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PREFACE 


For this 14th iteration of the International Conference on Educational Data 
Mining (EDM 2021), the conference was held completely online. EDM is 
organized under the auspices of the International Educational Data Mining 
Society and was meant to happen in Paris, France. The conference, held June 
29th through July 2nd, 2021, follows thirteen previous editions (Online, 2020, 
Montreal 2019, Buffalo 2018, Wuhan 2017, Raleigh 2016, Madrid 2015, London 
2014, Memphis 2013, Chania 2012, Eindhoven 2011, Pittsburgh 2010, Cérdoba 
2009, and Montreal 2008). 


The official theme of this year’s conference was Shifting Landscape of Education: 
Improving Blended and Distance Learning. This theme focused on identifying 
learning or teaching strategies that can be used to improve learning in various 
formats, such as partially or fully online, synchronous or asynchronous, and 
centralized or federated. In addition to the general topics, we welcomed research 
in the following areas: receiving implicit and explicit feedback from learners in 
BDL environments, interacting with students to ensure no learner is left behind, 
integrating and utilizing learning analytics in BDL environments to cope with 
switching between in-person and online modes, and addressing emerging privacy 
and ethical challenges in the new learning setting. This year’s conference featured 
three invited talks: Cristina Conati, Professor at University of British Columbia; 
Sidney D’Mello, Associate Professor at University of Colorado Boulder; and 
Pierre Dillenbourg, Professor at Ecole Polytechnique Fédérale de Lausanne. 


Building on the policy started in 2019, EDM 2021 continued using a double- 
blind review process. To continue EDM 2020 efforts, the conference’s Program 
Committee was significantly expanded by inviting new committee members from 
the past authors of the EDM community and the PC members of the related 
conferences. In total, 33 senior and 74 ordinary program committee members, in 
addition to the conference and track chairs, contributed to the reviewing process. 
This year, we received a total of 100 full-paper submissions and 84 short-paper 
submissions. From the full-paper submissions, 22% were accepted as full papers, 
25% were accepted as short papers, and 16% were accepted as posters. From 
the short-paper submissions, 23% were accepted as short papers and 23% were 
accepted as posters. 


Review & Decision Processes: For transparency and possible benefit of 
future EDM conferences, we are providing a detailed description of the paper 


review and decision processes for the Full and Short paper tracks at EDM 2021: 


1. After all papers were submitted, the Program Committee (PC) and Senior 
Program Committee (SPC) members directly bid on which papers they would 
like to review. 


2. If committee members did not bid on papers after several reminders, bids were 
assigned to them. This was done automatically via the EasyChair conference 
management system. 


3. Given the PC and SPC bids, the Program Chairs assigned papers to reviewers 
using EasyChair’s automatic assignment option. This assignment maximizes the 
total score of the assignment, with high weight on matches where the bid was a 
“ves”, medium weight on matches where the bid was a “maybe”, and low weight 
on matches where the bid was a “no”. The assignments were checked by the 
program chairs to ensure review restrictions of the program committee members 
and the number of reviews they would have preferred, if they had any. Each 
paper was assigned to one SPC member and two PC members. Considering 
the increased number of submissions and the review limitations of PC members, 
each PC member received on average 5.2 papers, and each SPC member received 
on average 4.1 papers. The maximum number of papers assigned to a program 
committee member was 6 papers. The automated reviewing assignment was 
manually checked to ensure fairness to reviewers in being primarily assigned 
papers for which they had entered positive bids, fairness to papers in being 
primarily assigned reviewers who had bid positively on that paper, and that 
automatic conflict detection had accurately detected conflicts. A set of changes 
was made based on this manual check, either due to assigning a paper to all 
reviewers that had bid “no” on it or due to assigning multiple papers to a 
reviewer who had assigned a “no” bid on all of them. 


4. In an effort to increase the mean and decrease the variance in review quality, 
the Program Chairs defined reviewing guidelines, both for the PC and the SPC. 
These guidelines were posted to the EDM 2021 website and also linked in emails 
sent to reviewers. 


5. At the end of the review period, the Program Chairs identified papers that 
received fewer than 3 reviews, as well as papers whose reviews were clearly 
lacking (e.g., just 1-2 sentences). Emergency reviewers (including the Program 
Chairs) were identified, and papers were assigned to them. 


6. The Program Chairs examined the meta-reviews and acceptance/rejection 
recommendations for all papers. For any papers lacking a meta-review, the 
Program Chairs read the reviews and the paper, wrote a meta-review, and 
arrived at a recommendation for acceptance/rejection. 


7. Papers were ranked by their weighted average review scores. The Program 
Chairs then manually identified and examined papers in “critical regions” of 
the ranking in which there was large variance in the meta-reviewers’ decision 
recommendations (Accept as Full, Accept as Short, Accept as Poster, Reject) or 
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there was large difference between the weighted and unweighted average review 
scores. The goal here was to ensure that, in the opinions of both Program 
Chairs, all papers accepted as either Full or Short exhibited sufficient rigor for 
publication as such. When in doubt, the more conservative outcome (i.e., Accept 
as Short rather than Full, or Accept as Poster rather than Short) was chosen. In 
particular: 


(a) For the Full paper track, the following range was calculated: 
Let my be the lowest score of any paper recommended by its meta- 
reviewer for “Accept as full”, and let n, be the highest score of any 
paper recommended by its meta-reviewer for “Accept as short”. For 
any paper recommended for “Accept as full” whose score was in 
[my, 7s], the Program Chairs discussed the paper and decided jointly 
whether to Accept as Full or Short. This deliberation focused on 
the question: “Do the reviewers point out important methodological 
or other fundamental problems that could significantly threaten 
validity?” 


(b) The analogous process (both for papers submitted as Full, and for 
papers submitted as Short) was applied to papers whose weighted or 
unweighted average review scores were in the range [m,, np], where 
m, is the lowest score of any paper recommended for Accept as Short 
and np is the highest score of any paper recommended for Accept as 
Poster. 


(c) All other papers — ie., those whose unweighted average re- 
view scores were outside the ranges described above — were ac- 
cepted/rejected according to the recommendation of their assigned 
meta-reviewer. 


During all aspects of both the Review and Decision processes, no Program Chair 
examined or handled any paper on which she was a co-author; any such paper 
was seen and handled exclusively by the other Chair to avoid a conflict of interest. 
(No papers were co-authored by both Program Chairs.) 


Note that papers submitted to the Industry, Doctoral Consortium, Poster/Demo, 
and Workshop components of EDM 2021 had their own reviewing processes 
that were defined by the corresponding chairs in consultation with the Program 
Chairs. Papers published in the Poster/Demo track are the union of those 
submitted & accepted as Posters/Demos, and those submitted as either the Full 
or Short tracks that were accepted as Posters. 


Posters/Demos: In addition to the Full or Short paper submissions that were 
accepted as posters mentioned above, there was a dedicated Poster/Demo track 
to which papers could be submitted directly. This track accepted 10 contributions 
out of 20 submissions. 


JEDM: Together with the Journal of Educational Data Mining (JEDM), the 
EDM 2021 conference held a JEDM Track that provides researchers a venue to 
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deliver more substantial mature work than is possible in a conference proceeding 
and to present their work to a live audience. The papers submitted to this track 
followed the JEDM peer review process. Three JEDM papers are featured in 
the conference’s program. 


Industry: The main conference invited contributions to an Industry Track in 
addition to the main track. The EDM 2021 Industry Track received 5 submissions, 
of which 4 were accepted. 


Doctoral Consortium: The EDM conference continues its tradition of pro- 
viding opportunities for young researchers to present their work and receive 
feedback from their peers and senior researchers. The doctoral consortium this 
year features 4 such presentations. 


Paper Topics: In terms of topics of all submitted papers, the tables (Table 1 & 
Table 2) below list the top 20 popular keywords associated with papers created 
by the EasyChair system: 


Keyword Weight 
Educational Data Mining 1000 
Knowledge Tracing 624 
Machine Learning 494 
Neural network 426 
Learning Analytics 397 
Higher education 305 
Student performance 299 
Intelligent Tutoring System 284 
Computer Science 278 
Data Mining 259 
Self-regulated learning 235 
Deep Knowledge Tracing 209 
Deep Learning 174 
Natural Language Processing 167 
Response time 164 
Research question 161 
Automated Essay Scoring 155 
Artificial Intelligence 154 
Peer assessment 149 


Keyword Weight 


Learning environment 142 


Table 1. Top 20 keywords associated with submitted papers 


Also, the table below lists the top 20 popular keywords associated with the 
accepted papers: 


Keyword Weight 


Educational Data Mining 300 


Knowledge Tracing 163 
Causal inference 124 
Machine Learning 96 
Neural network 90 
Learning Analytics 88 
Educational social network 74 
Computer Science 70 
Learning Agency 68 
Hidden Markov Model 67 
Intelligent Tutoring System 66 
Knowledge State 66 
Data Mining 65 
Student performance 58 
Stack Overflow 53 
Student knowledge state 49 
Computational linguistic 46 
Problem solving 45 
Curricular pattern 42 
Student success 38 


Table 2. Top 20 keywords associated with accepted papers 
Best Paper, Presentation, and Reviewer Awards 


Following the past EDM traditions, one best full paper, one best short paper, 


and one best student paper were selected and awarded. The selection process 
included the program chairs reviewing the papers with praised and consistently 
high rated papers by the program committee and the recommended papers by 
the senior program committee members. After selecting four nominees from all 
the full and short papers, a committee of senior EDM members voted, met, and 
conferred to select the awardees. The best student paper was selected from the 
list of full paper nominees, since all of them has student first-authors. The list 
of best paper awardees are: 


The best full paper: 


Just a Few Expert Constraints Can Help: Humanizing Data-Driven Subgoal 
Detection for Novice Programming. By Samiha Marwan, Yang Shi, Ian Menezes, 
Min Chi, Tiffany Barnes and Thomas Price 


The best student paper: 


Early Prediction of Conceptual Understanding in Interactive Simulations. By 
Jade Cock, Mirko Marras, Christian Giang and Tanja Kaser 


The best short paper: 


Do Common Educational Datasets contain Static Information? A Statistical 
Study. By Théo Barollet, Florent Bouchez-Tichadou and Fabrice Rastello 


As the conference moved to the online setting, the program chairs decided to add 
best paper presentation and best poster presentation awards to encourage high- 
quality presentations by paper authors and engagement and attendance of the 
community. These awards were selected by the EDM conference community and 
attendees in a rank-based voting system. The attendees could select rank paper 
and poster presentations separately via two online Google forms, by selecting 
the best, second-best, and third-best presentations during the conference. To 
avoid memory availability bias, the forms were extensively advertised in the daily 
emails sent to the attendees by the general chairs, and at the end of each session. 
The votes were tallied before the closing session to announce the awardees. 


The best poster presentation: 


Are Violations of Student Privacy “Quick and Easy”? Investigating the Privacy 
of Students’ Images and Names in the Context of K-12 Educational Institution’s 
Posts on Facebook. By Macy Burchfield, Joshua Rosenberg, Conrad Borchers, 
Tayla Thomas, Benjamin Gibbons and Christian Fischer 


The best paper presentation: 


Early Prediction of Conceptual Understanding in Interactive Simulations. By 
Jade Cock, Mirko Marras, Christian Giang and Tanja Kaser 


In addition to the above, as a way to thank the current program committee 
members and to encourage serving as a program committee member and providing 
high-quality reviews in the future EDM conferences, the program chairs added 
the best reviewer award to the list of awards. These reviewers were selected 
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by the program chairs after carefully reading all the reviews and meta-reviews 
of all papers, focusing on the reviewers providing extraordinary reviews, such 
as suggestions on how to improve the authors’ works and mentioning relevant 
literature to them, and the ones who have been on-time or volunteering to review 
more papers than average. The list of the best reviewers is: 


Agathe Merceron, Beuth University of Applied Sciences Berlin 
Andrew Olney, University of Memphis 

Anna Rafferty, Carleton College 

Cheng-Yu Chung, Arizona State University 

Christopher Brooks, University of Michigan 

Dragan Gasevic, Monash University 

Giora Alexandron, Weizmann Institute of Science 

James Lester, North Carolina State University 

Joshua Gardner, University of Michigan 

Julio Guerra, Universidad Austral de Chile 


Sébastien Lallé, The University of British Columbia, Department of Computer 
Science 


Sergey Sosnovsky, Utrecht University 
Shalini Pandey, University of Minnesota 
Stephen Fancsali, Carnegie Learning, Inc. 
Tanja Kaser, EPFL 


Vincent Aleven, Human-Computer Interaction Institute, Carnegie Mellon Uni- 
versity 


Test of Time Award: Following in the footsteps of last year’s conference, 
EDM 2021 also includes an invited talk by the authors of the 2020 winner of the 
EDM Test of Time Award. This year’s talk is delivered by Cristébal Romero. 


Workshops: In addition to the main program, there are six workshops accepted, 
including Reinforcement Learning for Education: Opportunities and Challenges; 
Causal Inference in Educational Data Mining; Workshop for Undergraduates in 
Educational Data Mining and Learning Engineering; A Workshop on Process 
Analysis Methods For Educational Data; The Second Workshop of The Learner 
Data Institute: Big Data, Research Challenges, & Science Convergence in 
Educational Data Science; The 5th Educational Data for Mining in Computer 
Science Education (CSEDM). 


Coronavirus: This year’s conference was originally arranged to take place in 
Paris, France. Due to the SARS-CoV-2 (coronavirus) epidemic, EDM 2021, as 
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well as most other academic conferences in 2021, had to be changed to a purely 
online format. This presented some difficulties, especially of how to engage and 
encourage interaction among participants using just Zoom and other online tools 
(i.e. Gather Town, SpeakUp, Whova, etc.) rather than face-to-face meetings. 
However, it also significantly reduced the costs of conducting and attending the 
conference since physical meeting spaces, air travel, and on-site lodging were 
no longer necessary — and this arguably increased our conference’s accessibility. 
To facilitate efficient transmission of presentations, especially given that not 
everyone’s Internet connection could be guaranteed to be stable, we required all 
paper presenters to pre-record their presentation as a video and then to host it 
online. 


Thanks: We thank Direction du numérique pour l’éducation (French MoE), Pix, 
CNRS, Inria, Eedi, EvidenceB, Educational Testing Service (ETS), Duolingo as 
sponsors of EDM 2021 for their generous support, especially during this financially 
difficult time of the coronavirus. We are also grateful to the individual conference 
chairs, the senior program committee, regular program committee members, 
subreviewers, emergency reviewers, and IEDMS board members, without whose 
expert input and hard work this conference would not be possible. Finally, we 
thank the entire organizing team and all authors who submitted their work to 
EDM 2021. 
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Abstract 


Educational data mining involves the application of data mining techniques to student activity. However, in the context of 
computer programming, many data mining techniques can not be applied because they require vector-shaped input whereas 
computer programs have the form of syntax trees. In this paper, we present ast2vec, a neural network that maps Python 
syntax trees to vectors and back, thereby enabling about a hundred data mining techniques that were previously not applicable. 
Ast2vec has been trained on almost half a million programs of novice programmers and is designed to be applied across learning 
tasks without re-training, meaning that users can apply it without any need for deep learning. We demonstrate the generality 
of ast2vec in three settings: First, we provide example analyses using ast2vec on a classroom-sized dataset, involving two novel 
techniques, namely progress-variance-projection for visualization and a dynamical systems analysis for prediction. In these 
examples, we also explain how ast2vec can be utilized for educational decisions. Second, we consider the ability of ast2vec to 
recover the original syntax tree from its vector representation on the training data and two further large-scale programming 
datasets. Finally, we evaluate the predictive capability of a linear dynamical system on top of ast2vec, obtaining similar 
results to techniques that work directly on syntax trees while being much faster (constant- instead of linear-time processing). 
We hope ast2vec can augment the educational data mining toolbelt by making analyses of computer programs easier, richer, 
and more efficient. 
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Abstract 


Human one-on-one coaching involves complex multimodal interactions. Successful coaching requires teachers to closely monitor 
students’ cognitive-affective states and provide support of optimal type, timing, and amount. However, most of the existing 
human tutoring studies focus primarily on verbal interactions and have yet to incorporate the rich aspects of multimodal 
cognitive-affective experiences. Meanwhile, the research community lacks principled methods to fully exploit the complex 
multimodal data to uncover the causal relationships between coaching supports and students’ cognitive-affective experiences 
and their stable individual factors. We explore an analytical framework that is explainable and amenable to incorporating 
domain knowledge. The proposed framework combines statistical approaches in Sparse Multiple Canonical Correlation, 
causal discovery and inference methods for observations. We demonstrate this framework using a multimodal one-on-one 
math problem-solving coaching dataset collected at naturalist home environments involving parents and young children. The 
insights derived from our analyses may inform the design of effective technology-inspired interventions that are personalized 
and adaptive 
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Abstract 


Adaptive spacing algorithms are powerful tools for helping learners manage their study time efficiently. By personalizing the 
temporal distribution of retrieval practice of a given piece of knowledge, they improve learners’ long-term memory retention 
compared to fixed review schedules. However, such algorithms are generally designed for the pure memorization of single 
items, such as vocabulary words. Yet, the spacing effect has been shown to extend to more complex knowledge, such as the 
practice of mathematical skills. In this article, we extend three adaptive spacing heuristics from the literature for selecting the 
best skill to review at any timestamp given a student’s past study history. In real-world educational settings, items generally 
involve multiple skills at the same time. Thus, we also propose a multi-skill version for two of these heuristics: instead of 
selecting one single skill, they select with a greedy procedure the most promising subset of skills to review. To compare these 
five heuristics, we develop a synthetic experimental framework that simulates student learning and forgetting trajectories with 
a student model. We run multiple synthetic experiments on large cohorts of 500 simulated students and publicly release the 
code for these experiments. Our results highlight the strengths and weaknesses of each heuristic in terms of performance, 
robustness and complexity. Finally, we find evidence that selecting the best subset of skills yields better retention compared 
to selecting the single best skill to review. 
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ABSTRACT 

Research studies in Educational Data Mining (EDM) often 
involve several variables related to student learning activi- 
ties. As such, it may be necessary to run multiple statisti- 
cal tests simultaneously, thereby leading to the problem of 
multiple comparisons. The Benjamini-Hochberg (BH) pro- 
cedure is commonly used in EDM research to address this 
issue, and it has proven to be a useful method. However, the 
main limitation of the procedure is that it requires the statis- 
tical tests to either be independent or satisfy certain depen- 
dency conditions. The Benjamini-Yekutieli (BY) procedure 
is an alternative that can be applied under arbitrary depen- 
dence assumptions, but this extra flexibility comes with a 
loss of statistical power; hence, the BH procedure is pre- 
ferred whenever it can be properly applied. Based on these 
considerations, in this work we employ simulation studies to 
assess the validity of the BH procedure in two scenarios com- 
mon to EDM research. The first scenario considers the eval- 
uation and comparison of different classification models— 
such an analysis might occur, for instance, during the model 
tuning and validation stage of a study. Then, in the second 
scenario we look at experiments involving the study of state 
transitions in sequential data, examples of which occur in 
affect dynamics research. We find that the BH procedure 
performs as expected when used with simulated classifica- 
tion model predictions; however, when applied to simulated 
sequential data, it does not perform at the expected level. 
Based on these results, as well as previous studies evaluating 
the BH and BY methods, we discuss the appropriate usage 
of these procedures for the scenarios under examination. 


Keywords 


Multiple comparisons, false discovery rate, Benjamini-Hochberg, 
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1. INTRODUCTION 


Consider a statistical analysis that tests several different null 
hypotheses, either on a single data set, or on related data 
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sets. In such a scenario, the probability of making a dis- 
covery—i.e., rejecting a null hypothesis—is higher than in 
an analysis involving a single null hypothesis. Thus, it fol- 
lows that the probability of rejecting a true null hypothesis 
increases as well; such errors are variously called false posi- 
tives, false discoveries, or type I errors. This is known in the 
statistics literature as the multiple comparisons problem. 


Studies in Educational Data Mining (EDM) and related 
fields are shaping the ongoing research and development of 
learning systems that are increasingly becoming part of ev- 
eryday classrooms—thus directly impacting student lives. 
Greater attention is needed to ensure that the conclusions 
drawn from these studies are reliable. Along these lines, 
controlling for multiple comparisons is an important consid- 
eration, as it has been argued that addressing the issue is a 
major factor in ensuring the replicability of scientific results 
[2]. Additionally, many exaggerated or even incorrect re- 
sults can be explained by the testing of multiple hypotheses 
without adjusting for the number of comparisons [34, 40]; 
while this issue commonly occurs with observational data, 
experimental studies are not immune to the problem [30]. 


The main focus of this study is the Benjamini-Hochberg 
(BH) procedure [3], a method that is commonly applied in 
EDM research to control the false discovery rate (FDR)— 
defined as the expected rate of false discoveries among all the 
discoveries made—when multiple statistical tests are used. 
One complication with using the BH procedure is that, in 
order for the theoretical guarantees on its performance to 
hold, the statistical tests must either be independent or sat- 
isfy certain dependency conditions [3, 4]. The Benjamini- 
Yekutieli (BY) procedure is an alternative method that can 
be used under arbitrary dependence assumptions among the 
statistical tests [4]. As the BY procedure is more gener- 
ally applicable than the BH procedure, it is by necessity 
more conservative and thus less likely to classify a result 
as being statistically significant; in turn, this causes it to 
have lower statistical power compared to the BH procedure. 
Thus, the BH procedure is to be preferred over the BY pro- 
cedure whenever it can be properly applied. 


However, the difficulty is that verifying the conditions for 
applying the BH procedure is not always straightforward; 
while some scenarios have been mathematically proven to 
satisfy these conditions, many common examples have not 
been. For instance, as of 2010 the case of pairwise com- 
parisons had not been mathematically proven to satisfy the 
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conditions for using the BH procedure [1], and to the best of 
our knowledge that has not changed in the interim. Because 
it’s not always clear if the conditions for applying the BH 
procedure are satisfied, it is often used without any theo- 
retical guarantees on its performance [15]. In other situa- 
tions, researchers may resort to using both the BH and BY 
procedures and comparing the results [28]. Motivated by 
these issues, in this work we investigate two different sce- 
narios that occur within EDM research, with the goal of 
understanding if the BH procedure is appropriate for each 
situation. In both scenarios, we assume that a researcher 
wants to control the FDR, ideally with the BH procedure, 
but is unsure if it will work as desired. As we are unable to 
provide mathematical proofs for these scenarios, we instead 
turn to simulation studies, a procedure that is commonly 
used to investigate the performance of multiple comparison 
procedures [1, 3, 14, 22, 31, 32, 38, 39]. 


The outline of the paper is as follows. We first discuss the 
specifics of the BH and BY procedures and how to apply 
them when performing multiple hypothesis tests; addition- 
ally, we also look at how multiple comparisons are handled 
in the EDM community by surveying the literature from the 
last five EDM conference proceedings. Then, in the remain- 
der of the paper we evaluate the BH and BY procedures for 
two scenarios that EDM researchers may encounter in their 
work. The first scenario concerns the usage of these proce- 
dures for evaluating and comparing the performance of clas- 
sification models. In this scenario, we make pairwise com- 
parisons of simulated classifiers, using both accuracy and the 
area under the receiver operating characteristic curve (AU- 
ROC) to evaluate their performance; such a situation can 
occur, for example, when trying to find the best performing 
combinations of model algorithms and hyperparameters. 


The next scenario we look at is the analysis of state tran- 
sitions in sequential data. In such an analysis, researchers 
typically run several hypothesis tests to try and determine 
the importance of the various transitions between states. 
Examples of this occur in affect dynamics research, where 
the BH procedure is commonly used [18, 29]. Here, we run 
analyses on simulated sequences of transitions using two dif- 
ferent statistical measures, and we then apply the BH and 
BY procedures and compare the results. Finally, based on 
the results of our numerical experiments, as well as the exist- 
ing literature on controlling the FDR, we discuss the usage 
of the BH and BY procedures in these scenarios. 


2. CONTROLLING FOR MULTIPLE COM- 
PARISONS 
2.1 Benjamini-Hochberg and 


Benjamini- Yekutieli Procedures 

In this study we focus on procedures for controlling the false 
discovery rate (FDR). The FDR was introduced in [3], and 
it has since found widespread use in many scientific fields in- 
cluding education research [38], genetics [31, 35], and medi- 
cal studies [4]. If we let V be the number of false discoveries 
and S be the number of true discoveries, as done in [3] we 
can define the quantity Q as 


ae one ifV+S>0, (1) 
~ 10, otherwise. 


Then, the FDR is equal to E[Q], the expected proportion of 
false discoveries among all the discoveries made. 


The family-wise error rate (FWER), which is defined as the 
probability of making at least one false discovery when per- 
forming a set of hypothesis tests, is another measure com- 
monly associated with the problem of multiple comparisons. 
Although the Bonferroni correction is probably the most 
famous procedure used to control the FWER, there exist 
many other alternatives. However, while such procedures 
can be useful in situations in which a false discovery must 
be avoided at all costs, such as clinical trials of new medical 
treatments [16], the downside to these methods is a loss of 
statistical power, resulting in an increased likelihood of miss- 
ing true discoveries. While procedures for controlling the 
FWER are concerned with the occurrence of any false dis- 
coveries, FDR controlling procedures are slightly more per- 
missive, as they allow a certain proportion of the discoveries 
to be false. Thus, the advantage of FDR controlling pro- 
cedures is that they typically have greater statistical power 
and, as such, a better chance of correctly identifying true dis- 
coveries; the resulting trade-off is that false discoveries are 
more likely. However, this trade-off can be beneficial when 
a large number of hypothesis tests are being conducted,’ or 
when the research is of a slightly more exploratory nature. 


In addition to introducing the FDR to the scientific litera- 
ture, the authors in [3] also outlined the BH procedure. As 
shown there, the BH procedure is mathematically proven 
to control the FDR at a given level when the statistical 
tests—or, equivalently, the test statistics—are independent. 
However, in many practical applications the statistical tests 
may have some underlying dependence between them. With 
these situations in mind, further important work on control- 
ling the FDR appeared in [4], where the authors proved that, 
in addition to the independent case, the BH procedure is 
valid under certain dependency conditions between the sta- 
tistical tests. Among other scenarios, it was shown that the 
BH procedure properly controls the FDR with multivariate 
normal test statistics having nonnegative correlations. Ad- 
ditionally, the authors in [4] introduced the BY procedure for 
situations in which the BH procedure is not valid, and they 
proved that the BY procedure controls the FDR regardless 
of the dependence between the tests. 


In the remainder of this section we discuss the application 
of the BH and BY procedures. Consider a statistical analy- 
sis that involves the testing of m null hypotheses. Of these 
null hypotheses, mo < m are true null hypotheses—these 
correspond to the hypotheses that we expect a statistical 
test to classify as not being significant—while the remaining 
m—mo hypotheses are the false null hypotheses. Note that, 
in practice, mo is an unknown value. Let Pi,..., Pm be the 
p-values for the m statistical tests, with these values being 
listed in ascending order; the corresponding null hypothe- 
ses are then represented by H1,..., Hm. The relationships 


‘As a relatively extreme example, statistical analyses in ge- 
netics research can involve thousands of hypothesis tests, 
and in such cases FWER controlling procedures can be 
overly restrictive [1]. 
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between these various terms can be summarized as follows. 


Not significant 
True null U 
False null T 


e m= total number of hypotheses being tested 
@ mo = number of true null hypotheses 


e V = number of false positives (i.e., false discoveries or 
type I errors) 


e S = number of true positives 
e T = number of false negatives (i.e., type II errors) 


e U = number of true negatives 


Let q represent our chosen threshold—or, level—for control- 
ling the FDR, and define FDRmax = mo 9. If the statistical 
tests are independent, or if they satisfy certain dependency 
conditions, it was shown in [4] that the FDR resulting from 
an application of the BH procedure is at most FDRmax. Such 
an application works as follows. Assuming once again that 
the p-values are in ascending order, we find the largest in- 
teger k such that Py < £q. Then, we simply reject all the 
null hypotheses H; for which i < k. 


Next, as the BY procedure controls the FDR under arbitrary 
dependence assumptions, it is necessarily more conservative 
when rejecting a null hypothesis. This takes the form of a 
lower threshold for the upper bound used to determine the 
“significance” of the p-values. Specifically, we find the largest 
integer k such that Pp < sare where c(m) = S77", ;. 
Using this procedure, it was shown in [4] that the resulting 
FDR is bounded above by FDRmax = mo g, regardless of the 


type of dependence between the statistical tests. 


To see how these procedures work, we next look at an exam- 
ple. Suppose we run 10 separate statistical tests (m = 10) 
that return the following p-values. 


0.002, 0.008, 0.011, 0.013, 0.023, 
0.028, 0.092, 0.214, 0.647, 0.853 


Next, we compare these p-values to the formulas used for 
the BH and BY thresholds, using a value of g = 0.1; for 


added context, we also include the results for the Bonferroni 
correction. For each method, the thresholds that correspond 
to statistically significant p-values are in bold. 

k P. Pu B Bonterionl 

mad ae: qd md 

1 0.002 |} 0.01 0.003 0.01 

2 0.008 || 0.02 0.007 0.01 

3 0.011 || 0.03 0.010 0.01 

4 0.013 |} 0.04 0.014 0.01 

5 0.023 || 0.05 0.017 0.01 

6 0.028 || 0.06 0.020 0.01 

7 0.092 || 0.07 0.024 0.01 

8 0.214 || 0.08 0.027 0.01 

9 0.647 || 0.09 0.031 0.01 

10 0.853 |] 0.1 0.034 0.01 


For the BH procedure, we can see that k = 6 is the largest 
value for which Py < ~4, as we have 0.028 < 0.06. Thus, 
the BH procedure, using a value of 0.1, would reject the 
null hypothesis for the statistical tests corresponding to the 
lowest six p-values. Next, for the BY procedure we see that 
k = 4 is the largest value for which P, is less than the cor- 
responding threshold; in this case, we have 0.013 < 0.014. 
It’s worth noting that, in this example, even though both P2 
and P3 are not below the corresponding thresholds, the BY 
procedure still classifies them as being statistically signifi- 
cant. This is a feature of FDR controlling procedures that, 
in many cases, allows them to be more permissive than pro- 
cedures for controlling the FWER. 


2.2 Applications in EDM Research 


To understand how EDM research is controlling for multi- 
ple comparisons, we reviewed EDM conference proceedings 
from the last five years (2016-2020). We found that, among 
the 22 papers that reported controlling for multiple com- 
parisons,” half used the Bonferroni correction and half used 
the BH procedure, with no studies using the BY procedure. 
Based on the method used to perform the comparisons, the 
studies can be partitioned as follows: group comparison (8), 
pairwise comparison (8; including pairwise model compari- 
son), correlation (4), and regression analysis (2). The studies 
involving group comparisons used statistical methods such 
as the Mann-Whitney U test, chi-squared test, t-test, and 
ANOVA. The studies employing pairwise comparisons used 
methods such as the Kruskal-Wallis test, Mann-Whitney U 
test, McNemar’s test, chi-squared test, and t-test. Overall, 
these 22 studies investigated diverse educational constructs 
in virtual learning environments including affect, student 
behavior in MOOCs, help-seeking, collaboration, and self- 
regulation. 


The choice between the Bonferroni correction and the BH 
procedure varied in the studies, as the selection was not com- 
pletely determined by the study methodology. For instance, 
an exploratory study used the more conservative Bonferroni 
method for a correlational analysis [61], while an experimen- 
tal study with group comparisons used the less conservative 
BH procedure [46]. For EDM research, selecting between the 
Bonferroni correction and the BH procedure may not be uni- 
versal and likely depends on the context of the study. As an 
example, consider that an analysis examining student demo- 
graphic differences on an important educational construct— 
such as self-efficacy, affect, or learning—likely has fewer data 
samples from underrepresented minorities [20]. In such a 
case, penalizing the statistical power with a more conser- 
vative method like the Bonferroni correction may lead to 
missed opportunities for critical discoveries related to eq- 
uity. On the other hand, contrast this with the evaluation 
of an expensive and large-scale educational technology inter- 
vention in a public school system; given the costs involved, 
both financially and otherwise, it could be argued that such 
an evaluation requires a more conservative approach to con- 
trol for false discoveries. 


More broadly, EDM research may not always involve large 
data sets. This is particularly true for educational constructs 
that require resource-intensive data collection procedures— 


2See Section 8 for the full list of references. 
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e.g., training coders, gathering approvals, and conducting 
classroom studies. Hence, using the Bonferroni correction 
to control for multiple comparisons at the expense of losing 
statistical power may not always be affordable. In contrast, 
using the BH procedure in scenarios that violate its sta- 
tistical assumptions may lead to invalid conclusions. Our 
review of EDM studies from the last five years also revealed 
that the field may not be taking advantage of the BY proce- 
dure, especially in scenarios where it is difficult to meet the 
assumptions of the BH procedure. These observations are 
what motivated us to investigate the use of the BH and BY 
procedures in research settings relevant to EDM. 


3. METHODS 


In this section we outline the general procedure we follow 
for our simulation studies. Since evaluating multiple com- 
parison procedures requires knowledge of whether a null hy- 
pothesis is true, and as this isn’t typically known with real 
data, simulations are commonly used for such analyses. In 
all of our experiments, we begin by generating simulated 
data according to a given probability distribution. While 
the specifics of this procedure vary for the two scenarios we 
consider, the common thread is that this must be done in a 
way as to have control over whether or not each null hypoth- 
esis is true. For example, in our comparisons of simulated 
classification models, the performance of each model is con- 
trolled by a single parameter; thus, when this parameter 
differs for two models, the null hypothesis that the models 
perform equally well is false. 


Another important detail is that, as we are focusing on two 
particular scenarios, we can generate simulated data specific 
to these scenarios. That is, for the model comparison exper- 
iments we simulate both the classifier predictions and the 
ground truth labels; then, for the state transition analysis 
we generate simulated sequences of states. By simulating 
the underlying data for each scenario, we are attempting to 
evaluate the BH and BY procedures in conditions that are 
as realistic as possible. In comparison, other studies that 
are more general in nature may simulate the distribution of 
the test statistics, rather than the underlying data, when 
evaluating multiple comparison procedures. 


After generating the data for a simulation run, we perform 
our statistical tests and compute the corresponding p-values. 
Once this is done, we then apply the BH and BY procedures 
for various threshold values q—specifically, we use 0.05, 0.1, 
and 0.15 in all our evaluations. While a value of 0.05 is 
commonly used, it’s been argued that this threshold may 
be too low for some applications [26]; thus, we evaluate a 
range of values in our simulations. Based on the statisti- 
cal significance results from our application of the BH and 
BY procedures, we can compute Q, the proportion of false 
discoveries among all the discoveries made, using (1). To 
obtain our estimate of the FDR, we then compute the av- 
erage of Q over a total of 10,000 simulation runs. For the 
various values of g, we compare these FDR estimates to the 
values of FDRmax as defined in Section 2.1. 


At this point, it’s worth mentioning that the value of Q— 
and, hence, the estimated FDR value—can be very different 


from the false positive rate.° Using the notation in (2), 
the false positive rate can be written as v0: In compar- 


ison, Q is computed with the formula ee which has a 
different denominator. Thus, while the FDR is the expected 
proportion of false discoveries among all the rejected null 
hypotheses, the false positive rate is the (expected) propor- 
tion of false discoveries among all the true null hypothe- 
ses. Consider the following example. Assume we are testing 
20 total hypotheses, all of which are true null hypotheses 
(mo = m = 20). Furthermore, assume that one false posi- 
tive is recorded. Then, the false positive rate for this set of 


tests would be equal to THis = 0.05. However, applying (1) 


gives a value of Q = ee = 1. This discrepancy is some- 
thing to keep in mind as we analyze the results from our 


simulation studies in subsequent sections. 


4. MODEL COMPARISONS 


The first scenario we study concerns the comparison of sev- 
eral classification models on a fixed set of test or validation 
data. A common example of this occurs during the model 
building process, where it may be necessary to evaluate the 
performance of many different combinations of classification 
models and hyperparameters. In such a case, it can be help- 
ful for the researcher to run statistical tests to more precisely 
quantify the differences in performance. To that end, we 
focus on the pairwise comparisons of the classifiers, where 
we assume that the classifiers could have different underly- 
ing algorithms—e.g., logistic regression vs. random forest— 
or the same algorithm with different hyperparameters. We 
evaluate each pair of classifiers by looking at both the accu- 
racy and the area under the receiver operating characteristic 
curve (AUROC). To measure the possible difference between 
the accuracy values of the models, we use McNemar’s test 
[13, 27]. When conducting pairwise comparisons of classi- 
fier accuracy on a fixed set of test data—as opposed to a 
procedure such as k-fold cross-validation, where the test set 
varies—using McNemar’s test is recommended [10]; for these 
evaluations we use the implementation in the statsmodels 
[33] Python library. Then, to compare the AUROC values 
we use DeLong’s test [9], a method developed to statistically 
test for differences in AUROC values; specifically, we apply 
the fast version of the algorithm outlined in [36].* 


Our simulations use the following procedure. We assume 
that we are evaluating the performance of a binary classifier 
on a test set containing n data points; for these simulations 
we use n-values of 500, 1000, and 5000. For each value of n, 
we sample n numbers uniformly at random from 0.01 to 0.99; 
we refer to this set of numbers as U,. In each simulation 
run, the numbers in U, are used to generate the labels for 
our data using the following procedure. Let 7 be an integer 
from 1 to n, and let pi € Un. With probability p; we assign 
a label of 1 to y;; otherwise, with probability 1—p, it is then 
given a label of 0. Note that the set Un is generated once 
for each value of n, and this same set is then used repeatedly 
for all of our simulation runs with a test set of size n. 


° That is, while “false discovery” and “false positive” are used 
interchangeably, the terms “false discovery rate” and “false 
positive rate” have different definitions. 

“The code for our implementation of the algorithm in [36], 
as well as for running all of our experiments, is available at 
https://github.com/jmatayoshi/multiple-comparisons. 
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Table 1: Accuracy and AUROC values for an example sim- 
ulation run using a test set of size n = 1000. 


o 0.1 0.1 0.1 0.5 1 2 


Accuracy || 0.733 0.724 0.732 0.706 0.651 0.606 
AUROC || 0.824 0.821 0.824 0.787 0.721 0.655 


We next describe our procedure for simulating the classi- 
fier predictions. Let ci; represent the predicted probability 
given by classifier 7 for the i-th data point in our test set. To 
generate cj;, we begin by converting p; € Un to a z-score. 
Then, to add noise to the classifier’s prediction we randomly 
sample a value, s;;, from a normal distribution with mean 0 
and standard deviation o;, add this to the z-score, and then 
convert everything back to a probability; the resulting value 
is cij. The size of 0; controls the performance of the classi- 
fier, with lower values giving predicted probabilities that are 
less noisy and more likely to align with the class labels. Let 
® denote the cumulative distribution function (CDF) of the 
standard normal distribution. Our procedure for generating 
the classifier predictions can be summarized as follows. 


1. a= &~'(p;) 
2. Draw sample value si; from NV (0, 03) 
3. Cig = O(z: + Si;) 


To get an idea of the effect of different values of o on the per- 
formance of our simulated classifier predictions, in Table 1 
we show the accuracy and AUROC values from one simu- 
lation run, using different values of o and a test set size of 
n = 1000. The three classifiers with o values of 0.1 have the 
best performance, with accuracy values from 0.72 to 0.73 
and AUROC values around 0.82. The other classifiers, to 
varying degrees, perform worse, with the lowest accuracy 
and AUROC values at roughly 0.61 and 0.66, respectively. 
Our initial analysis simulates the pairwise comparison of six 
different classification models, where all the classifiers are 
assumed to perform equally; specifically, we use a value of 
o = 0.5 for each model. Using our previously described 
procedure, we generate a total of 10,000 simulation runs for 
each value of n. Our experimental setup results in (8) =15 
pairwise comparisons (m = 15), and as there are no underly- 
ing differences between the simulated classifiers, we have 15 
true null hypotheses (mo = 15). As such, if the conditions 
for the BH procedure are satisfied, we expect the FDR to 
be less than FDRmax = 7249 =q. The results are shown in 
Figures 1 and 2, where we display the estimated FDR rates 
for the BH and BY procedures, for different combinations 
of test set sizes and values of g. Using both McNemar’s test 
and DeLong’s test, the BH procedure appears to control the 
FDR by keeping it below the corresponding FDRmax value, 
shown by the dashed line, in all cases—that is, for all com- 
binations of test set sizes and q. In comparison, the BY 
procedure is much more conservative, with each estimated 
FDR value far below the FDRmax line. 


For our second set of simulations, we use the values of o 
that appear in Table 1 to generate six different models. As 
there are three models with the same value of 0 = 0.1, we 
have (3) = 3 true null hypotheses (mo = 3) out of 15 total 
comparisons (m = 15). Thus, under the appropriate condi- 
tions the BH procedure should keep the FDR at or below 


FDRmax = aq = zq- The results are given in Figures 3 and 
4, where we can see that the estimated FDR values using the 
BH procedure are at or below the value of FDRmax, given 
by the dashed line, in all cases—that is, for all combinations 
of test set sizes and q. As before, the estimated FDR values 
from the BY procedure are very low, with each value again 
appearing far below the corresponding FDRmax line. 


These results are seemingly consistent with previous works 
analyzing the performance of the BH procedure with pair- 
wise comparisons [21, 38]. The findings from several of these 
studies are summarized in [39], where the author states that 
in “all the studies, for all configurations of true and false hy- 
potheses simulated, for balanced and for non-balanced de- 
signs, normal and non-normal distributions, the BH proce- 
dure controlled the FDR.” Thus, combining these previous 
results with our experiments from this section, there appears 
to be good evidence that the BH procedure properly controls 
the FDR in the case of pairwise comparisons of classification 
models. We return to this subject in the discussion. 


5. TRANSITIONS IN SEQUENTIAL DATA 


In our second scenario we look at data that are sequential 
in nature, as examples of such data appear in many areas of 
educational research. One particular focus with sequential 
data is the analysis of transitions between different states— 
or events—in these sequences. Researchers are often inter- 
ested in understanding if transitions between certain pairs 
of states are significant, either because they happen often or 
because they rarely appear. Typically in such cases, many 
pairs of states are evaluated with statistical tests, thus neces- 
sitating a correction for multiple comparisons. For example, 
past studies have analyzed logs of student actions in learn- 
ing systems, in an attempt to understand the differences 
between productive and unproductive transitions between 
activities within these systems [5, 6]. Another example is 
affect dynamics research, which studies sequences of student 
affective states, with the goal of understanding how students 
transition between these different states. Previous works in 
this area have used the BH procedure to control the FDR 
[18, 29], and as such the goal of our next analysis is to in- 
vestigate the appropriateness of using this procedure when 
analyzing state transitions. 


5.1 Experimental Setup 

Our numerical experiments for sequential data evaluate the 
BH and BY procedures on simulated sequences of states. 
Each of these sequences could represent, for example, a stu- 
dent’s affective states while working in a learning system. 
The states are randomly sampled according to the proba- 
bility distribution given in Table 2; each entry in the table 
gives the probability of sampling the next state (column) 
based on the value of the previous state (row). For exam- 
ple, suppose that C is the previous state. In this case, A 
has a probability of 0.2 of being the next state, B has a 
probability of 0.2 — 7 of being the next state, and so on. 


For our simulations, we use two different values for y: 0, 
which results in all 25 hypotheses being true null hypothe- 
ses; and 0.05, which results in 21 true null hypotheses, out of 
the 25. For each value of 7, we generate n sequences consist- 
ing of 20 states each. To generate these sequences, the first 
state in each sequence is sampled randomly from the five 
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Figure 1: Comparison of the estimated FDR for the BH and BY procedures, using McNemar’s test and six classifiers with 
the same value of o = 0.5. Vertical lines represent the 99% confidence interval for each estimated FDR value. 
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Figure 2: Comparison of the estimated FDR for the BH and BY procedures, using DeLong’s test and six classifiers with the 
same value of o = 0.5. Vertical lines represent the 99% confidence interval for each estimated FDR value. 
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Figure 3: Comparison of the estimated FDR for the BH and BY procedures, using McNemar’s test and the o values in Table 1. 
Vertical lines represent the 99% confidence interval for each estimated FDR value. 
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Figure 4: Comparison of the estimated FDR for the BH and BY procedures, using DeLong’s test and the o values in Table 1. 
Vertical lines represent the 99% confidence interval for each estimated FDR value. 


Table 2: Probability distribution used to generate the sim- 
ulated sequences of states. Each entry represents the prob- 
ability of making a transition to the next state (column), 
given the previous state (row). 


4 £: a Pp # 
prev 
A 0.2 O02+7 0.2 02-y 0.2 
B 0.2 0.2 0.2 0.2 0.2 
C 0.2 02-y 02 O02+7 0.2 
D 0.2 0.2 0.2 0.2 0.2 
E 0.2 0.2 0.2 0.2 0.2 
Table 3: Marginal model coefficient p-values from one sim- 


ulation run using y = 0.05. With a threshold of g = 0.05, 
both the BH and BY procedures give the same statistical 
significance results for this example; namely, only the four 
transition pairs with sample probabilities modified by y are 
statistically significant. 


next 


pies A B C D E 
A 0.252 0.000 0.335 0.000 0.703 
B 0.496 0.365 0.327 0.864 0.252 
C 0.035 0.000 0.527 0.000 0.569 
D 0.260 0.652 0.080 0.980 0.889 
E 0.581 0.099 0.800 0.869 0.179 


choices, and then all subsequent states are sampled accord- 
ing to the probability distribution in Table 2. For each set 
of n sequences we evaluate our statistical tests (described in 
Sections 5.2 and 5.3) and then compute the resulting value 
for Q; this constitutes one simulation run. We then perform 
10,000 simulation runs for each value of n in order to obtain 
an estimate of the true FDR. For this analysis, we use the 
following values of n: 50, 100, and 200. 


The L statistic, originally introduced in [12], is intended to 
be used as a measure of the significance of different pairs 
of transitions, and it has been widely applied in the study 
of affect dynamics [11, 12, 18]. Given two states A and B, 
it measures the likelihood of transitions from A to B while 
taking into account the overall frequency at which B occurs. 
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However, several recent works have revealed issues with the 
use of the L statistic for the analysis of state transitions 
[7, 18, 19]. Thus, for our simulations we use two newer 
methods that have been developed in response to the prob- 
lems with the L statistic. First, in Section 5.2 we look at the 
performance of the BH procedure when used in combination 
with the marginal model approach outlined in [25]. Then, 
in Section 5.3 we evaluate the BH procedure when it is used 
with the modified version of the L statistic from [24]. 


5.2 Marginal Model 

To estimate the influence that starting in state A has on the 
probability of making a transition to B, in this section we use 
the marginal model regression procedure from [25]. In this 
approach, the regression model has a binary response vari- 
able, where the value of this variable is one if the next state 
is equal to B, and it is zero otherwise. Based on the binary 
response variable, we use the logit as our link function. Our 
predictor—or, independent—variable is also binary, with a 
value of one if the previous state is equal to A and zero 
otherwise. We can summarize this procedure as follows. 


e y= yit: one if B is the next state for student 7 at time 
t; zero otherwise 


e x =x: one if A is the previous state for student 7 at 
time t; zero otherwise 


Letting S represent the standard logistic function, the re- 
gression equation then has the form 


1 


P(yie = 1] vie) = S(Bo + Brie) 1 + e—(BotBizit)” 


(3) 


When xz = 1 the regression model returns an estimate for 
P(B|A), the probability of a transition to B, given that 
the starting state is A. Then, when x = 0 it returns an 
estimate for P(B| A), the probability of a transition to B, 
given that the starting state is not A. Thus, to measure 
the importance of starting in state A, we focus on testing 
if the value of (6, is significantly different from zero. This 
is done using a two-tailed z-test on the value of 8; for each 
individual fit of the regression model. 


Finally, as the sequential data used in these analyses typi- 
cally take the form of repeated measurements on a student, 
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the result is a set of dependent—or correlated—data. To 
account for this dependence, as outlined in [25] we use a 
marginal model, based on generalized estimating equations 
(GEE) [17, 23], to estimate the logistic regression coeffi- 
cients; in particular, we use the GEE implementation from 
the statsmodels Python library. 


As before, let m denote the total number of statistical tests, 
with mo < m representing the number of true null hypothe- 
ses. Using the BH procedure with a value of y = 0, we 
have mo = m; as such, we would expect the FDR to be less 
than FDRmax = q = q if the BH conditions are satisfied. 
Then, for all values of 7 > 0 we would expect the FDR to be 
less than FDRmax = 4, assuming the BH conditions are 
satisfied, as mo = 21 of the tests are true null hypotheses. 


The first set of results, using a value of 7 = 0, is shown in 
Figure 5. Here, we can see that in all cases the estimated 
FDR values from the BH procedure are above the theoret- 
ical upper bound of FDRmax, shown by the dashed line. 
The gap is particularly notable with smaller numbers of se- 
quences. On the other hand, the BY procedure offers much 
more stringent control of the FDR, with all of the estimated 
values appearing below the FDRmax line. Figure 6 then 
shows the results from using a value of y = 0.05. Overall, 
the picture appears similar to the y = 0 case, with the esti- 
mated FDR values from the BH procedure always appearing 
above the FDRmax line, and with the difference again being 
more pronounced with smaller numbers of sequences. 


5.3. Removing Self-Transitions 

Our final set of experiments investigates a specific situa- 
tion in sequential data analysis that occurs when researchers 
want to remove the influence of repeated states. To do this, 
many researchers in the affect dynamics community remove 
self-transitions—i.e., transitions where the same state is re- 
peated for more than one step—before analyzing the data 
[18]. However, this procedure has been shown to overesti- 
mate the significance of transitions when used with the L 
statistic [19]. Thus, for this analysis we instead use a modi- 
fied version of the L statistic, named L* [24]. 


DEFINITION 1. Let A and B be two states, and let 
Tz = {transitions where the next state is not A}. (4) 
Then, we define 


B|A,Tz) — P(B| Tz) 


* wl 


(5) 


where P(B|A,T z) is the probability of a transition to B in 
Tx, given that the starting state is A, while P(B|T z) is the 
overall probability of a transition to B in Tz. 


The base rate of the state B, given by P(B|T x) in (5), 
can be computed either individually for each sequence, or 
averaged over the entire set of sequences. For the computa- 
tions in the remainder of this work, we compute these rates 
individually per sequence. 


Our analysis using L* applies the statistic to the sequences 
from our experiments in Section 5.2. Specifically, we take 


each sequence and, for each pair of transition states, com- 
pute (5). To test for statistical significance, we follow the 
procedure outlined in [24] and apply a two-tailed t-test to 
the L* values. The results for the y = 0 and y = 0.05 se- 
quences are shown in Figures 7 and 8, respectively. While 
perhaps not quite as prominent as with the marginal model 
procedure, there are several examples where the estimated 
FDR values from the BH procedure are clearly above the 
FDRmax line. As with the marginal model procedure, the 
worst cases occur with the smallest number of sequences. 


5.4 Dependence of the Statistical Tests 

The experiments in this section provide evidence that, when 
used in combination with either the marginal model proce- 
dure or L*, the BH procedure does not always control the 
FDR at the desired level; in turn, this may indicate that the 
conditions for applying the BH procedure are not satisfied. 
In the remainder of this section, we outline two arguments 
that show the assumption of independence is violated be- 
tween the statistical tests used in these analyses. Note that 
these are not rigorous mathematical proofs; rather, our goal 
here is to simply give some intuition into the relationships 
between the statistical tests. 


Consider a set of sequential data consisting of possible states 
A, B, C, D, and E. For states A and B, let 64,5 represent 
the value of 8; in (3) for transitions of the form A > B. 
Suppose that the following inequalities hold. 


Ba,a >0 
Ba,c > 0 


Ba,Bp > 0 
Bap >0O 


(6) 


Consider, for example, 64,2. The corresponding marginal 
model estimates the probability of a transition to B, de- 
pending on whether or not the starting state is A—these es- 
timates correspond to P(B|A) and P(B| A), respectively. 
The inequalities in (6) can then be interpreted as follows. 


P(A| A) > P(A| A) 
P(C| A) > P(C|A) 


P(B|A) > P(B|A) 
P(D|A) > P(D|A) 
Next, consider the following two equalities. 

P(E | A) =1-— P(A| A) — P(B|A) — P(C| A) — P( 
P(E|A)=1-—P(A| A) — P(B| A) — P(C| A) — P( 


D|A) 
D|A) 
(8) 


Combining (7) and (8), it follows that P(E| A) < P(E| A), 
or, equivalently, that Baz < 0. What this argument il- 
lustrates is that it’s not possible—or, at least, it’s highly 
unlikely—for 8,4,r to be positive when the other four coeffi- 
cients are positive, which means that the corresponding sta- 
tistical tests are not completely independent of each other. 


Next, suppose we are in the situation of removing self-transitions 


and applying L*; thus, in what follows assume we are inter- 
ested in transitions from A to B and that, following (4) in 
Definition 1, all transitions to A have been removed from 
our sequence. Suppose the following inequalities hold. 
P(B| A) > P(B) 
P(C| A) > P(C) (9) 
P(D|A)> P(D) 
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Figure 5: Comparison of the estimated FDR for the BH and BY procedures, using a value of 7 = 0 and the marginal model 
method. Vertical lines represent the 99% confidence interval for each estimated FDR. value. 
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Figure 6: Comparison of the estimated FDR for the BH and BY procedures, using a value of 7 = 0.05 and the marginal model 
method. Vertical lines represent the 99% confidence interval for each estimated FDR. value. 


Consider the equalities 


P(E| A) =1- P(B| A) — P(C| A) — P(D|A) 


P(E) =1—- P(B)- P(C) - P(D), ae) 


where we’re using the fact that, as we are removing transi- 
tions to A, P(A| A) = 0 and P(A) = 0. Combining (9) and 
(10), it follows that P(E| A) < P(£). Thus, it’s not possi- 
ble for all four of the conditional probabilities to be larger 
than the base probabilities—in turn, this means that at least 
one of the L* values must be negative. As such, it follows 
that the corresponding statistical tests are not completely 
independent of each other. 


6. DISCUSSION 


In this paper, we investigated the validity of methods used to 
adjust for false discoveries when performing multiple com- 
parisons. In two scenarios relevant to EDM research, we 
evaluated the performance of the commonly used BH proce- 
dure in relation to an alternate method—the BY procedure— 
that is more general and is valid to use when the assump- 
tions of the BH procedure cannot be met. Our first set 
of experiments looked at the performance of these proce- 
dures when used with pairwise comparisons of classification 
models on a fixed set of test data. In all our experiments, 
using both accuracy and AUROC as our performance met- 


rics, the BH procedure controlled the FDR at the expected 
level. These results are consistent with previous studies in- 
vestigating pairwise comparisons, where in all cases the BH 
procedure properly controlled the FDR [21, 38, 39]. Com- 
bining these previous results with the experiments in this 
study, our current view is that the usage of the BH pro- 
cedures appears justified in this scenario—that is, one can 
reasonably expect the BH procedure to properly control the 
FDR when performing pairwise comparisons of classifiers on 
a fixed set of test data. 


Contrast this with our investigation on sequential data, where 
we observed that the BH procedure, when combined with 
either the marginal model procedure or L*, did not control 
the FDR at the expected level—this happened with various 
experimental conditions and for various threshold values q. 
The results could be an indication that the theoretical condi- 
tions for applying the BH procedure might not be satisfied in 
these situations. Combined with the fact that various issues 
involving the analysis of state transitions have recently come 
to light [7, 18, 19, 24, 25], we believe that using the more 
conservative BY procedure is justified, particularly when the 
analysis involves a small number of sequences. To compen- 
sate for the fact that it is more conservative, when applying 
the BY procedure we suggest the use of a larger value of q, 
such as 0.1. 
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Figure 7: Comparison of the estimated FDR for the BH and BY procedures, using a value of y = 0 and the L” statistic. 
Vertical lines represent the 99% confidence interval for each estimated FDR value. 
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Figure 8: Comparison of the estimated FDR for the BH and BY procedures, using a value of y = 0.05 and the L* statistic. 
Vertical lines represent the 99% confidence interval for each estimated FDR value. 


More generally, it’s worth noting that there are many ex- 
amples where the BH procedure performs well without any 
theoretical guarantees [14, 22]. Thus, for situations in which 
the BH procedure has not been theoretically or empirically 
vetted, we offer a couple of suggestions. First, whenever pos- 
sible, conducting a simulation study may be helpful; as seen 
in this work, the results could give evidence for or against 
the usage of the BH procedure. Failing that, and if there 
is good reason to doubt the validity of using the BH proce- 
dure, we suggest that the BY procedure be considered as a 
possible alternative. In these cases, a higher value for q may 
be justified in order to compensate for the more restrictive 
nature of the BY procedure, and this decision could be made 
based on the context of the study. For instance, in studies 
that are exploratory in nature or have small sample sizes, 
the loss of statistical power might be a larger concern; thus, 
the BY procedure using a threshold of 0.1 or larger may 
be appropriate. Whereas, in an experimental study looking 
for conclusive evidence, it may be preferable to use the BY 
procedure with a smaller value of q. 


In regards to future work in this area, it would be of interest 
to more completely understand why the BH procedure fails 
to properly control the FDR in our simulations with sequen- 
tial data. While we presented an argument in Section 5.4 
that showed the statistical tests are not independent, it’s 


an open question whether this argument can be extended to 
rigorously show that the assumptions of the BH procedure 
are violated—we are currently looking at this in more de- 
tail. Furthermore, it’s possible that other elements may also 
be at play. For example, as discussed previously there are 
known issues with several existing methods commonly used 
to evaluate state transitions. While the methods we used 
in this study were originally developed in response to these 
problems [24, 25], it’s possible that these existing issues, or 
perhaps even new ones, are a factor; thus, further adjust- 
ments to the marginal model and L* methods could lead to 
improved control of the FDR with the BH procedure. 


There exist other directions for future work that we are cur- 
rently exploring. First, as the literature on multiple com- 
parisons and controlling the FDR is actively growing, many 
methods have been developed over the years. Thus, while 
the BH and BY procedures are arguably the most notable of 
the FDR controlling procedures, it would be worthwhile to 
evaluate some of the newer alternatives, especially for the 
analysis of state transitions. Second, our analyses in this 
work focused exclusively on false discoveries (Type I errors) 
and did not consider false negatives (Type II errors). As 
such, in future work we aim to explicitly examine the inter- 
action between these two types of errors with respect to the 
BH and BY procedures and EDM research. 
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ABSTRACT 


We address the problem of predicting the correctness of 
the student’s response on the next exam question based on 
their previous interactions in the course of their learning 
and evaluation process. We model the student performance 
as a dynamic problem and compare the two major classes 
of dynamic neural architectures for its solution, namely the 
finite-memory Time Delay Neural Networks (TDNN) and 
the potentially infinite-memory Recurrent Neural Networks 
(RNN). Since the next response is a function of the knowl- 
edge state of the student and this, in turn, is a function of 
their previous responses and the skills associated with the 
previous questions, we propose a two-part network architec- 
ture. The first part employs a dynamic neural network (ei- 
ther TDNN or RNN) to trace the student knowledge state. 
The second part applies on top of the dynamic part and it 
is a multi-layer feed-forward network which completes the 
classification task of predicting the student response based 
on our estimate of the student knowledge state. Both input 
skills and previous responses are encoded using different em- 
beddings. Regarding the skill embeddings we tried two dif- 
ferent initialization schemes using (a) random vectors and 
(b) pretrained vectors matching the textual descriptions of 
the skills. Our experiments show that the performance of the 
RNN approach is better compared to the TDNN approach in 
all datasets that we have used. Also, we show that our RNN 
architecture outperforms the state-of-the-art models in four 
out of five datasets. It is worth noting that the TDNN ap- 
proach also outperforms the state of the art models in four 
out of five datasets, although it is slightly worse than our 
proposed RNN approach. Finally, contrary to our expec- 
tations, we find that the initialization of skill embeddings 
using pretrained vectors offers practically no advantage over 
random initialization. 
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1. INTRODUCTION 


Knowledge is distinguished by the ability to evolve over 
time. This progression of knowledge is usually incremen- 
tal and its formation is related to the cognitive areas being 
studied. The process of Knowledge Tracing (KT) defined as 
the task of predicting students’ performance has attracted 
the interest of many researchers in recent decades [4]. The 
Knowledge State (KS) of a student is the degree of his or 
her mastering the Knowledge Components (KC) in a certain 
domain, for example “Algebra” or “Physics”. A knowledge 
component generally refers to a learnable entity, such as a 
concept or a skill, that can be used alone or in combination 
with other KCs in order to solve an exercise or a problem 
[9]. Knowledge Tracing is the process of modeling and as- 
sessing a student’s KS in order to predict his or her ability 
to answer the next problem correctly. The estimation of the 
student’s knowledge state is useful for improving the educa- 
tional process by identifying the level of his/her understand- 
ing of the various knowledge components. By exploiting this 
information it is possible to suggest appropriate educational 
material to cover the student’s weaknesses and thus maxi- 
mize the learning outcome. 


The main problem of Knowledge Tracing is the efficient man- 
agement of the responses over time. One of the factors which 
add complexity to the problem of KT is the student-specific 
learning pace. The knowledge acquisition may differ from 
person to person and may also be influenced by already ex- 
isting knowledge. More specifically, KT is predominantly 
considered as a supervised sequence learning problem where 
the goal is to predict the probability that a student will an- 
swer correctly the future exercises, given his or her history 
of interactions with previous tests. Thus, the prediction of 
the correctness of the answer is based on the history of the 
student’s answers in combination with the skill that is cur- 
rently examined at this time instance. 


Mathematically, the KT task is expressed as the probability 
P(ri+1 = 1lqe4i, Xe) that the student will offer the correct 
response in the next interaction r++1, where the students 
learning activities are represented as a sequence of interac- 
tions X; = {x1,2,x3,...,v¢} over time T. The 2; interac- 
tion consists of a tuple (q,rz) which represents the ques- 
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tion q being answered at time ¢t and the student response 
rz, to the question. Without loss of generality, we shall as- 
sume that knowledge components are represented by skills 
from a set S = {81, 82, ...,8m}. One simplifying assumption, 
used by many authors [24], is that every question in the set 
Q = {qu,@,---, qr} is related to a unique skill from S. Then 
the knowledge levels of the student for each one of the skills 
in S compose his or her knowledge state. 


The dynamic nature of Knowledge Tracing leads to approa- 
ches that have the ability to model time-series or sequential 
data. In this work we propose two dynamic machine learning 
models that are implemented by time-dependent methods, 
specifically recurrent and time delay neural networks. Our 
models outperform the current state-of-the-art approaches 
in four out of five benchmark datasets that we have studied. 
The proposed models differ from the existing ones in two 
main architectural aspects: 


e we find that attention does not help improve the per- 
formance and therefore we make no use of attention 
layers 


e we experiment with and compare between two dif- 
ferent skill embedding types: (a) initialized by pre- 
trained embeddings of the textual descriptions of the 
skill names using standard methods such as Word2Vec 
and FastText and (b) randomly initialized embeddings 
based on skill ids 


The rest of the paper is organized as follows. Section 2 re- 
views the related works on KT and the existing models for 
student performance prediction. In Section 3 we present our 
proposed models and describe their architecture and char- 
acteristics. The datasets we prepared and used are present 
in Section 4 while the experiments setup and the results 
are explained in Section 5. Finally, Section 6 concludes this 
work and discusses the future works and extensions of the 
research. 


2. RELATED WORKS 


The problem of knowledge tracing is dynamic as student 
knowledge is constantly changing over time. Thus, a variety 
of methods, highly structured or dynamic, have been pro- 
posed to predict students’ performance. One of the earlier 
methods is Bayesian Knowledge Tracing (BKT) [4] which 
models the problem as a Hidden Markov chain in order to 
predict the sequence of outcomes for a given learner. The 
Performance Factors Analysis Model (PFA) [14] proposed to 
tackle the knowledge tracing task by modifying the Learning 
Factor Analysis model. It estimates the probability that a 
student will answer a question correctly by maximizing the 
likelihood of a logistic regression model. The features used 
in the PFA model, although interpretable, are relatively sim- 
ple and designed by hand, and may not adequately represent 
the students’ knowledge state [23]. 


Deep Knowledge Tracing (DKT) [15] is the first dynamic 
model proposed in the literature utilizing recurrent neural 
networks (RNN) and specifically the Long Short-Term Mem- 
ory (LSTM) model [6] to track student knowledge. It uses 
one-hot encoded skill tags and associated responses as inputs 


and it trains the neural network to predict the next student 
response. The hidden state of the LSTM can be considered 
as the latent knowledge state of a student and can carry the 
information of the past interactions to the output layer. The 
output layer of the model computes the probability of the 
student answering correctly a question relating to a specific 
Knowledge Component. 


Another approach for predicting student performance is the 
Dynamic Key-Value Memory Network (DKVMN) [24] which 
relies on an extension of memory networks proposed in [12]. 
The model tries to capture the relationship between differ- 
ent concepts. The DKVMN model outperforms DKT us- 
ing memory slots as key and value components to encode 
the knowledge state of students. Learning or forgetting of 
a particular skill are stored in those components and con- 
trolled by read and write operations through the Least Re- 
cently Used Access (LRUA) attention mechanism [16]. The 
key component is responsible for storing the concepts and is 
fixed during testing while the value component is updated 
when a concept state changes. The latter means that when 
a student acquires a concept in a test the value component 
is updated based on the correlation between exercises and 
the corresponding concept. 


The Deep-IRT model [23] is the newest approach that ex- 
tends the DKVMN model. The author combined the capa- 
bilities of DKVMN with the Item Response Theory (IRT) 
[5] in order to measure both student ability and question dif- 
ficulty. At the same time, another model, named Sequential 
Key-Value Memory Networks (SKVMN) [1], tried to over- 
come the problem of DKVMN to capture long term depen- 
dencies in the sequences of exercises and generally in sequen- 
tial data. This model combines the DKVMN mechanism 
with the Hop-LSTM, a variation of LSTM architecture and 
has the ability to discover sequential dependencies among 
exercises, but it skips some LSTM cells to approach previ- 
ous concepts that are considered relevant. Finally, another 
newly proposed model is Self Attentive Knowledge Tracing 
(SAKT) [13]. SAKT utilizes a self-attention mechanism and 
mainly consists of three layers: an embedding layer for in- 
teractions and questions followed by a Multi-Head Attention 
layer [19] and a feed-forward layer for student response pre- 
diction. 


The above models either use simple features (e.g. PFA) 
or they use machine learning approaches such as key-value 
memory networks or attention mechanisms that may add 
significant complexity. However we will show that similar 
and often, in fact, better performance can be achieved by 
simpler dynamic models combining embeddings and recur- 
rent and/or time-delay feed-forward networks as proposed 
next. 


3. PROPOSED APPROACH 
3.1 Dynamic Models 


As referenced in the relative literature, knowledge change 
over time is often modeled by dynamic neural networks. The 
dynamic models produce output based on a time window, 
called “context window”, that contains the recent history of 
inputs and/or outputs. 


There are two types of dynamic neural networks (Figure 1): 
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(a) Time-Delay Neural Networks (TDNN), with only feed- 
forward connections and finite-memory of length L equal to 
the length of the context window, and (b) Recurrent Neu- 
ral Networks (RNN) with feed-back connections that can 
have potentially infinite-memory although, practically, their 
memory length is dictated by a forgetting factor parameter. 


Yt Ut 


feed-back 


Feed-forward 
connections 


Feed-forward 
connections 


Figure 1: Dynamic model architectures: (a) Time- 
Delay Neural Network (b) Recurrent Neural Net- 
work. 


3.2 The Proposed Models 


We approach the task of predicting the student response 
(0=wrong, 1=correct) on a question involving a specific skill 
as a dynamic binary classification problem. In general, we 
view the response r; as a function of the previous student 
interactions: 


re = h(ae, ae 1, Qt-2,+-+,Tt-1,1r a,.--) +e (1) 


where qt, is the skill tested on time t and e; is the prediction 
error. The response is therefore a function of the current and 
the previous tested skills {q@:, qt-1, @:—2,--- }, aS well as the 
previous responses {rz—1, Tt-2,... } given by the student. 


We implement h as a dynamic neural model. Our proposed 
general architecture is shown in Figure 2. The inputs are 
the skill and response sequences {q}, {r} collected during 
a time-window of length L prior to time t. Note that the 
skill sequence includes the current skill q but the response 
sequence does not contain the current response which is ac- 
tually what we want to predict. The architecture consists of 
two main parts: 


e The Encoding sub-network. It is used to represent 
the response and skill input data using different em- 
beddings. Clearly, embeddings are useful for encoding 
skills since skill ids are categorical variables. We found 
that using embeddings to encode responses is also very 
beneficial. The details of the embeddings initialization 
and usage are described in the next section. 


e The Tracing sub-network. This firstly estimates the 
knowledge state of the student and then uses it to pre- 
dict his/her response. Our model function consists of 
two parts: (i) the Knowledge-Tracing part, represented 
by the dynamic model f, which predicts the student 
knowledge state v; and (ii) the classification part g, 


which predicts the student response based on the esti- 
mated knowledge state: 
ve f(Q4,% 1, Qt-2,+-+,Tt-1,1r 2;nee) (2) 
* = g(ve) (3) 


Depending on the memory length, we obtain two cat- 
egories of models: 


(a) models based on RNN networks which can poten- 
tially have infinite memory. In this case the KT 
model is recurrent: 


Ve = f(Ve-1, Gt, Qt-1, +++) MEL) Tt-1) +++ Tr—L) 


(b) models based on TDNN networks which have fi- 
nite memory of length L. In this case the KT 
model has finite impulse response L: 


ve = f (at, ae 1y+++5Qt-L,1t-1,---;Tr L) 


Although RNNs have been used in the relevant literature, it 
is noteworthy that TDNN approaches have not been investi- 
gated in the context of knowledge tracing. The classification 
part is modeled by a fully-connected feed-forward network 
with a single output unit. 


Encoding 
Classification 


Encoding 


Tracing Sub-net 


Figure 2: General proposed architecture. The dy- 
namic model can be either a Recurrent Neural Net- 
work (with a feedback connection from the output 
of the dynamic part into the model input) or a Time 
Delay Neural Network (without feedback connec- 
tion). 


We investigated two different architectures: one based on 
recurrent neural networks and another based on time delay 
neural networks. The details of each proposed model archi- 
tecture are described below. 


3.3. Encoding Sub-network 

The first part in all our proposed models consists of two 
parallel embedding layers with dimensions d, and d,, re- 
spectively, which encode the tested skills and the responses 
given by the student. During model training the weights of 
the Embedding layers are updated. The response embed- 
ding vectors are initialized randomly. The skill embedding 
vectors, on the other hand, are initialized either randomly 
or using pretrained data. In the latter case we use pre- 
trained vectors corresponding to the skill names obtained 
from Word2Vec [11] or FastText [7] methods. 


A 1D spatial dropout layer [18] is added after each Em- 
bedding layer. The intuition behind the addition of spatial 
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dropout was the overfitting phenomenon that was observed 
in the first epochs of each validation set. We postulated that 
the correlation among skill name embeddings, that might 
not actually exist, confused the model. 


3.4 Tracing Sub-network 

We experimented with two types of main dynamic sub-net- 
works, namely Recurrent Neural Networks and Time Delay 
Neural Networks. These two approaches are described next. 


3.4.1 RNN Approach: Bi-GRU Model 
The model architecture based on the RNN method for the 
knowledge tracing task is shown in Figure 3. 


questions q@,...,Q—L 


Skill Embeddings 


responses T-1,.--,Tt—L 


Response Embeddings 
Spatial Dropout Spatial Dropout 
Convolutional Convolutional 


Cc “| Bidirectional-GRU 
Gaussian Dropout 


Vt 
Dense 


output *F; 


Figure 3: Bi-GRU model 


The Spatial Dropout rate following the input embedding 
layers is 0.2 for most of used datasets. Next, we feed the 
skills and the responses input branches into a Convolutional 
layer consisting of 100 filters, with kernel size 3, stride 1, 
and ReLU activation function. The Convolutional layer acts 
as a projection mechanism that reduces the input dimen- 
sions from the previous Embedding layer. This is found to 
help alleviate the overfitting problem. To the best of our 
knowledge, Convolutional layers have not been used in pre- 
viously proposed neural models for this task. The two in- 
put branches are then concatenated to feed a Bidirectional 
Gated Recurrent Unit (GRU) layer with 64 units [3]. Batch 
normalization and ReLU activation layers are applied be- 
tween convolutional and concatenation layers. This struc- 
ture has resulted after extensive experiments with other pop- 
ular recurrent models such as LSTM, plain GRU and also 
bi-directional versions of those models and we found this 
to be the proposed architecture is the most efficient one. 


On top of the RNN layer we append a fully connected sub- 
network consisting of three dense layers with 50 and 25 units 
and one output unit respectively. The first two dense layers 
have a ReLU activation function while the last one has sig- 
moid activation which is used to make the final prediction 
(0<% <1). 


3.4.2. TDNN Approach 


In our TDNN model (Figure 4) we add a Convolutional layer 
after each embedding layer with 50 filters and kernel size 
equal to 5. 


questions q@,...,Q—L 


Skill Embeddings 
Spatial Dropout Spatial Dropout 
Convolutional Convolutional 


Gaussian Dropout 


Vi 


responses Ty_1,.--, TL 


Response Embeddings 


output 7; 


Figure 4: TDNN model 


Batch normalization is used before the ReLU activation is 
applied. As with the RNN model, the two input branches 
are concatenated to feed the classification sub-network. It 
consists of four dense layers with 20, 15, 10, and 5 units 
respectively, using the ReLU activation function. This fun- 
nel schema of hidden layers (starting with wider layers and 
continuing with narrower ones) has helped achieve better 
results for all datasets we have experimented with. In the 
beginning of the classification sub-network we insert a Gaus- 
sian Dropout layer [17] which multiplies neuron activations 
with a Gaussian random variable of mean value 1. This has 
been shown to work as good as the classical Bernoulli noise 
dropout and in our case even better. 


4. DATASETS 


We tested our models using four popular datasets from the 
ASSISTments online tutoring platform. Three of them, “AS- 
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Table 1: Datasets Overview. 


Dataset Skills | Students | Responses | Baseline Accuracy 
ASSISTment09 110 4,151 325,637 65.84% 
ASSISTment09 corrected | 101 4,151 274,590 66.31% 
ASSISTment 12 196 28,834 2,036,080 69.65% 
ASSISTment17 101 1,709 864,713 62.67% 
FSAI-F1toF3 99 310 51,283 52.98% 


SISTment09”, “ASSISTment09 corrected”, and “ASSIST- 
ment12”” were provided by the above platform. The fourth 
dataset, named “ASSISTment17” was obtained from 2017 
Data Mining competition page®. Finally a fifth dataset, 
“FSAI-F1toF3” provided by “Find Solution Ai Limited” was 
also used in our experiments. It is collected using data from 
the from the 4LittleTrees* adaptive learning application. 


4.1 Datasets Descriptions 

The ASSISTments datasets contain data from student tests 
on mathematical problems [2] and the content is organized in 
columns style. The student’s interaction is recorded on each 
line. There are one or more interactions recorded for each 
student. We take into account the information concerning 
the responses of students to questions related with a skill. 
Thus, we use the following columns: “user_id”, “skill_id”, 
“skill name”, and “correct”. The “skill name” contains a ver- 
bal description of the skill tested. The “correct” column con- 
tains the values of the students’ responses which are either 
1 (for correct) or 0 (for wrong). 


The original “ASSTSTment09” dataset contains 525,534 stu- 
dent responses. It has been used extensively in the KT task 
from several researchers but according to [2] data quality 
issues have been detected concerning duplicate rows. In 
our work we used the “preprocessed ASSIS Tment09” dataset 
found on DKVMN? and Deep-IRT® models GitHubs. In this 
dataset the duplicate rows and the empty field values were 
cleaned, so that finally 1,451 unique students participate 
with 325,623 total responses and 110 unique skills. 


Even after this cleaning there are still some problems such as 
duplicate skill ids for the same skill name. These problems 
have been corrected in the ”Assistment09 corrected” dataset. 
This dataset contains 346,860 students interactions and has 
been recently used in [21]. 


The “ASSIS Tment12” dataset contains students’ data un- 
til the school year 2012-2013. The initial dataset contains 
6,123,270 responses and 198 skills. Some of the skills have 
the same skill name but different skill id. The total num- 
ber of skill ids is 265. The “Assistment17” dataset contains 
942,816 students responses and 101 skills. 


‘https: //sites.google.com/site /assistmentsdata/home/assis 
tment-2009-2010-data/skill-builder-data-2009-2010 

“https: / /sites.google.com/site/assistmentsdata/home/2012- 
13-school-data-with-affect 

3https: / /sites.google.com/view /assistmentsdatamining/dat 
a-mining-competition-2017 

‘https: //www.4littletrees.com 

https: //github.com/jennyzhang0215/DKVMN 

Shttps: //github.com/ckyeungac/DeepIRT 


Finally, the “FSAI-F1toF3” dataset is the smallest dataset 
we used. It involves responses to mathematical problems 
from 7th grade to 9th grade Hong Kong students and con- 
sists of 51,283 students responses from 310 students on 99 
skills and 2,266 questions. As it is commonly the case in 
most studies using this dataset, we have used the question 
tag as the model input q@. 


4.2 Data Preprocessing 

No preprocessing was performed on the “ASSISTment09” 
and “FSAI-FitoF3” datasets. For the remaining datasets 
we followed three preparation steps. 


First, the skill ids had been repaired by replacement. In par- 
ticular, the “ASSTSTments09 corrected” dataset contained 
skills of the form of “skilli_skill2” and “skill1_skill2_skill3” 
which correspond to the same skill names, so we have merged 
them into the first skill id, found before the underscore. In 
other words, the skill “70_138” was replaced with skill “10” 
and so on. Moreover, few misspellings were observed that 
were corrected and the punctuations found in three skill 
names were converted to the corresponding words. For ex- 
ample, in the skill name “Parts of a Polnomial Terms Coef- 
ficient Monomial Exponent Variable” we corrected the “Pol- 
nomial” with “Polynomial”. Also, in the skill name “Or- 
der of Operations +,-,/,*() positive reals” we replaced the 
symbols “+,-,/,* ()” with the words that express these sym- 
bols, ie. “addition subtraction division multiplication paren- 
theses”. The latter preprocessing action was preferred over 
the removal of punctuations since the datasets referred to 
mathematical methods and operations and without them, 
we would lose the meaning of each skill. Similar procedure 
has been followed for the “ASSTSTments12” dataset. Fur- 
thermore, spaces after some skill names were removed i.e. 
the skill name “Pattern Finding ” became “Pattern Find- 
ing”. In the “ASSISTment17” dataset we came across skill 
names as “application: multi-column subtraction” and cor- 
rected them by replacing punctuation marks such as “appli- 
cation multi column subtraction”. That text preparation op- 
erations made to ease the generation of word embeddings of 
the skill names descriptions. In addition, in the “ASSIST- 
ment17” dataset, the problem ids are used instead of the 
skill ids. We had to match and replace the problem ids with 
the corresponding skill ids with the aim of uniformity of the 
datasets between them. 


Secondly, all rows containing missing values were discarded. 
Thus, after the preprocessing, the statistics of the data sets 
were formulated as described in the Table 1. 


Finally, we split the datasets so that 70% was used for train- 
ing and 30% for testing. Then, the training subset was fur- 
ther split into five train-validation subsets using 80% for 
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training and 20% for validation. 


5. EXPERIMENTS 


In this section we experimentally validate the effectiveness 
of the proposed methods by comparing them with each other 
and also with other state-of-the-art performance prediction 
models. The Area Under the ROC Curve (AUC) [10] metric 
is used for comparing the predicting probability correctness 
of student’s response. 


The state-of-the-art knowledge tracing models we are com- 
pared with the DKT, DKVMN and Deep-IRT. We performed 
the experiments for our proposed models Bi-GRU, TDNN 
as well as for each of the previous model for all datasets, us- 
ing the code provided by the authors on their GitHubs. It 
is worth noting that the python GitHub code’ used for the 
DKT model experiments requires the entire dataset file and 
the train/test splitting is performed during the code execu- 
tion. 


All the experiments were performed on a workstation with 
Ubuntu operating system, Intel i5 CPU and 16GB Titan Xp 
GPU card. 


5.1 Skill embeddings initialization 

As mentioned earlier, skill embeddings are initialized either 
randomly or using pretrained vectors. Regarding the ini- 
tialization of the skill embeddings with pretrained vectors 
we used two methods described next. In first method we 
used the text files from Wikipedia2Vec* [22] that is based 
on Word2Vec method and contains pretrainable embeddings 
for the word representation vectors in English language in 
100 and 300 dimensions. In second method we used the “S/S- 
TER?” (SImple SenTence EmbeddeR)? library to prepare the 
skill name embeddings based on FastText in 300 dimensions 
pretrained word embeddings. Each skill name consists of one 
or more words. Thus, for the Word2Vec method, the skill 
name embeddings vector is created by adding the word em- 
beddings vectors, while in case of FastText, the skill name 
embeddings are created by taking the average of the word 
embeddings. 


Especially for the FsaiF1toF3 dataset, the question embed- 
dings are initialized either randomly or using the pretrained 
word representations of the corresponding skill descriptions 
by employing the Wikipedia2Vec and SISTER methods as 
described above. Since many questions belong to the same 
skill, in this case the corresponding rows in the embedding 
matrix are initialized by the same vector. 


5.2 Experimental Settings 

We performed the cross-validation method for the 5 train- 
ing and validation set pairs. This was to choose the best ar- 
chitecture and parameter settings for each of the proposed 
models. Using the train and test sets we evaluated the cho- 
sen architectures for all the datasets. 


“https: / /github.com/Iccasagrande/Deep-Knowledge- 
Tracing 


Shttps: / /wikipedia2vec.github.io/wikipedia2vec/ 
*https://pypi.org/project/sister / 


One of the basic hyperparameters of our models that affect 
to the inputs is the L. It represents the student’s interaction 
history window length. The inputs with L sequence of ques- 
tions and L — 1 sequence of responses. The best results we 
succeeded are when using L = 50 for the both Bi-GRU and 
TDNN models. The batch sizes used in the models during 
the training are: 32 in Bi-GRU and 50 in TDNN. 


Since specific dimensions of the pretrained word embeddings 
are provided, we used the same dimensions in case of random 
embedding in order to take the comparable results. Skill 
embeddings and responses embeddings set in the same di- 
mensions. 


The scheduler learning rate is implementing in Bi-GRU start- 
ing from 0.001 and reducing over the training operation of 
the models that performs for 30 epochs. During training we 
applied the following learning rate schedule depending on 
the epoch number n: 


‘lf Oe ifn < 10 
rinse X CF C2-™) otherwise 


In case of the TDNN-based model, the learning rate equals 
0.001 and is the same during the whole training process for 
30 epochs. We used cross-entropy optimization criterion and 
the Adam or AdaMax [8] learning algorithms. 


Dropout with rate = 0.2 or 0.9 is also applied to the Bi-GRU 
model while the dropout rate of the TDNN equals to one of 
the (0.2, 0.4, 0.6, 0.9) values through to the Gaussian dropout 
layer. We observed a reduction of overfitting during model 
training by changing the Gaussian dropout rate relative to 
the dataset’s size. Thus, the smaller dataset size is, the 
bigger dropout rate has been used. 


The various combinations of parameters settings were ap- 
plied during the experimental process for all proposed mod- 
els presented in Table 2. 


5.3. Experimental Results 

The experiments results of our models are shown in Table 3. 
Comparing our models with each other we can see that the 
RNN-based Bi-GRU model outperforms the TTDNN-based 
model in all datasets. It achieved best results when 100d 
embeddings were used either in pretrained or the random 
initialization type. 


We observed that in both Bi-GRU or TDNN, the embed- 
ding type is not the significant parameter that affects the 
models performance. The differences between the results of 
the experiments showed that the size of embeddings dimen- 
sions not particularly contributed to the final result and the 
difference in performance of the models was small. 


Except for our models, we performed experiments for all 
datasets on the previous models we compared. For three 
of the datasets, specifically for “ASSISTment09 corrected”, 
“ASSTSTment12” and “ASSISTment17” there were not avail- 
able results in the corresponding papers. In this paper, we 
present the results of the experiments we run using that 
models codes. 
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Table 2: 


Models experiments settings 


Parameters Bi-GRU TDNN 
Learning rate 0.001 0.001 
Learning rate schedule yes no 
Training epochs 30 30 
Batch size 32 50 
Optimizer Adam AdaMax 
History window length 50 50 
Skill embeddings dim. 100 & 300 100 & 300 
Skill embeddings type Random, W2V, FastText | Random, W2V, Fast'Text 
Responses embeddings dim. Same to skill dim. Same to skill dim. 
Responses embeddings type Random Random 


Table 3: Comparison between our proposed models - AUC (%). (R) = random skill embedding initialization, 
(W) = skill embedding initialization using W2V, (F) = skill embedding initialization using FastText. Datasets: 
(a) ASSTISTment09, (b) ASSTSTment09 corrected, (c) ASSISTment12, (d) ASSISTment17, (e) FSAI-F1toF3 


d, = 100(R) | d, = 300(R) | d, = 100(W) | d, = 300(W) | d, = 300(F) | 
Bi-GRU 82.55 82.45 82.52 82.55 82.39 
TDNN 81.54 81.67 81.59 81.50 81.53 | 


(a) 


d, = 100(R) 


dq = 300(R) 


dq = 100(W) 


d, = 300(W) | dy = 300(F) 


dq = 100(R) | d, = 300(R) | d, = 100(W) | d, = 300(W) | d, = 300(F) | 

Bi-GRU 73.62 73.58 73.76 73.54 73.58 | 

TDNN 71.68 71.75 71.52 71.81 71.33 | 
(d) 

d, = 100(R) | d; = 300(R) | d, = 100(W) | d, = 300(W) | d, = 300(F) | 

Bi-GRU 70.47 69.34 70.24 69.80 69.51 | 

TDNN 70.03 69.80 69.80 70.11 70.06 | 


The best experimental results of the ours models in com- 
parison with the previous models for each dataset are pre- 
sented in Table 4. The model that has the best performance 
for the four of datasets is the Bi-GRU. Except for that, the 
TDNN-based model has better performance in comparison 
to the previous models for four datasets. The only dataset, 
for which the previous models overcomed our models is the 
“ASSTSTment12”. 


5.4 Discussion 

Our model architecture is loosely based on the DKT model 
and offers improvements in the aspects discussed below. First, 
we employ embeddings for representing both skills and re- 
sponses. It is known that embeddings offer more useful rep- 
resentations compared to one-hot encoding because they can 
capture the similarity between the items they represent [20]. 
Second, we thoroughly examined dynamical neural models 
for estimating the student knowledge state by trying both 
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infinite-ememory RNNs and finite-memory TDNNs. To our 
knowledge, TDNNs have not been well studied in the litera- 
ture with respect to this problem. Third, we used convolu- 
tional layers in the inputs encoding sub-net. We found that 
this layer functioned as a reducing mechanism of the embed- 
ding dimensions and in conjunction with the dropout layer 
mitigated the overfitting problem. The use of Convolutional 
layers is a novelty in models tackling the knowledge tracing 
problem. Fourth, unlike DKT, we used more hidden layers 
in the classification sub-net. Our experiments demonstrate 
that this gives more discriminating capability to the classi- 
fier and improves the results. Finally, our experiments with 
key-value modules and attention mechanism did not help 
further improve our results and so these experiments are not 
reported here. In the majority of the datasets we examined 
our model outperforms the state-off the models employing 
key-value mechanisms such as DKVMN and Deep-IRT. 


In addition to the AUC metric which is typically used for 
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Table 4: Comparison test results of evaluation measures - the AUC metric (%) 


@) dg = d, = 100, Random, 
) gd, =d, = 100, W2V, 


Dataset DKT |] DKVMN | Deep-IRT | BitGRU TDNN 
ASSISTment09 81.56% | 81.61% 81.65% | 82.55%0) | 31.67%°) 
ASSISTment09 corrected | 74.27% | 74.06% 73.41% | 75.27% | 74.40%°) 
ASSISTment12 69.40% | 69.26% | 69.73% 68.40% | 67.99% 
ASSISTment 17 66.85% | 70.25% 70.54% | 73.76% | 71.83%© 
FSALF1toF3 69.42% | 68.40% 68.69% | 70.47%) | 70.11%?) 


) dy = d, = 300, W2V, 
(5) gd, = dy = 300, FastText 


(3) dq = d, = 300, Random, 


Table 5: Statistical significance testing results of Bi-GRU and TDNN 


Dataset P-value 

ASSISTment09 7.34 e-59 
ASSISTment09 corrected | 2.31 e-52 
ASSISTment12 1.45 e-203 
ASSISTment17 7.96 e-44 
FSAI-F1toF3 1.38 e-84 


evaluating the performance of our machine learning mod- 
els, we applied statistical significance testing to check the 
similarity between out Bi-GRU and TDNN models. Specif- 
ically, we performed a T-Test between the outcomes of the 
two models in all training data using the best configuration 
settings as shown in Table 4. The results reported in Table 
5 show that the P-value calculated in all cases is practically 
zero which proves the hypothesis that the two models are 
significantly different. 


6. CONCLUSION AND FUTURE WORK 


In this paper we propose a novel two-part neural network 
architecture for predicting student performance in the next 
exam or exercise based on their performance in previous ex- 
ercises. The first part of the model is a dynamic network 
which tracks the student knowledge state and the second 
part is a multi-layer neural network classifier. For the dy- 
namic part we tested two different models: a potentially 
infinite memory recurrent Bidirectional GRU model and a 
finite memory Time-Delay neural network (TDNN). The ex- 
perimental process showed that the Bi-GRU model achieves 
better performance compared to the TDNN model. De- 
spite the fact that TDNN models have not been used for 
this problem in the past, our results have shown that they 
can be just as efficient or even better compared to previ- 
ous state-of-art RNN models and only slightly worse than 
our proposed RNN model. The model inputs are the stu- 
dent’s skills and responses history which are encoded using 
embedding vectors. Skill embeddings are initialized either 
randomly or by pretrained vectors representing the textual 
descriptions of the skills. A novel feature of our architec- 
ture is the addition of spatial dropout and convolutional 
layers immediately after the embeddings layers. These ad- 
ditions have been shown to reduce the overfitting problem. 
We found that the choice of initialization of the skill embed- 
dings has little effect on the outcome of our experiments. 
Moreover, noting that there is a different use of the same 
datasets in different studies, we described in detail the pro- 
cess of the datasets pre-processing, and we provide the train, 
validation and test splits of the data that were used in our 


experiments on our GitHub repository!®. The extensive ex- 
perimentation with more benchmark datasets as well as the 
study of variants of the proposed models will be the subject 
of our future work with the aim of even further improving 
the prediction performance of the models. 
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ABSTRACT 


We investigated the feasibility of using automatic speech 
recognition (ASR) and natural language processing (NLP) to 
classify collaborative problem solving (CPS) skills from recorded 
speech in noisy environments. We analyzed data from 44 dyads 
of middle and high school students who used videoconferencing 
to collaboratively solve physics and math problems (35 and 9 
dyads in school and lab environments, respectively). Trained 
coders identified seven cognitive and social CPS skills (e.g., 
sharing information) in 8,660 utterances. We used a state-of-the- 
art deep transfer learning approach for NLP, Bidirectional 
Encoder Representations from Transformers (BERT), with a 
special input representation enabling the model to analyze 
adjacent utterances for contextual cues. We achieved a micro- 
average AUROC score (across seven CPS skills) of .80 using 
ASR transcripts, compared to .91 for human _ transcripts, 
indicating a decrease in performance attributable to ASR error. 
We found that the noisy school setting introduced additional ASR 
error, which reduced model performance (micro-average AUROC 
of .78) compared to the lab (AUROC = .83). We discuss 
implications for real-time CPS assessment and support in 
schools. 


Keywords 
Collaborative problem solving; natural language processing; 
collaborative interfaces 


1. INTRODUCTION 


The modern world will increasingly require teams of 
heterogeneous individuals to coordinate their efforts, share skills 
and knowledge, and communicate effectively in order to solve 
complex and pressing problems like the global pandemic and 
climate change. Accordingly, collaborative problem solving 
(CPS) — defined as two or more people engaging in a coordinated 
attempt to construct and maintain a joint solution to a problem 
[57] — has been identified as a critical skill for the 21st century 
workforce [23, 27]. Despite its increasing importance, the most 
recent 2015 Programme for International Student Assessment 
(PISA) assessment revealed troubling deficiencies in CPS 
competency worldwide [49]. As a result, improving CPS 
proficiency has become a priority in educational research and 
policy [7, 8, 16, 37, 49]. 
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Technology has fundamentally transformed both the modern 
workplace and classroom. Co-located teams in shared spaces are 
becoming less common, while distributed teams that work and 
collaborate remotely through virtual interfaces are on the rise 
(22, 36]. In 2020, the COVID-19 pandemic thrust this issue to 
the forefront of our attention, as workers and students across the 
globe were forced to adapt to a remote environment for extended 
periods of time. Accordingly, educational practitioners have 
emphasized the importance of providing students with the skills 
necessary to effectively collaborate in virtual settings [60]. 


The rise of videoconferencing in both workplace and learning 
environments brings with it the exciting opportunity to develop 
next-generation collaborative interfaces that can aid in teaching, 
assessing, and supporting CPS. Here we focus on the task of 
assessing CPS skills from spoken language with an eye for 
downstream applications including reflective feedback and 
dynamic interventions to improve CPS skills. 


Like any latent construct (e.g., intelligence, knowledge), 
assessment of CPS skills entails identifying objective evidence 
for those constructs. Because collaboration inherently involves 
communication, one promising approach is to analyze 
communication between team members [58]. Indeed, the content 
of communication during CPS provides information about a 
team’s cognitive and affective states, knowledge, information 
sharing, and coordination [27], and can serve as evidence of 
relevant CPS skills [3, 4]. 


However, analyzing the large amounts of data generated during 
open-ended collaboration is time consuming and costly, requiring 
trained human coders to review large corpus and hand code 
individual items for indicators of CPS. Previous work [24, 29, 
58, 65] has attempted to automate this coding process using 
natural language processing (NLP) techniques. However, with 
the exception of [65], this has been limited to restricted forms of 
communication such as text chat, rather than open-ended verbal 
communication, which is characteristic of most real world CPS. 
As we elaborate below, the one study [65] that successfully 
analyzed spoken communications for evidence of CPS skills used 
data collected in a highly controlled lab environment, leaving 
open the question as to whether this approach will succeed in the 
wild, such as in noisy classroom environments. 


In this work, we address the challenge of using speech 
recognition and NLP to automatically analyze open-ended 
student speech during videoconferencing-enabled collaborative 
problem solving in both real-world schools and in lab 
environments. Pursuing technologies capable of automatically 
capturing and analyzing spoken language during open-ended 
verbal CPS in authentic environments, whether face-to-face or 
via videoconferencing, is an important avenue of research. These 
technologies hold the potential for significantly improving real- 
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time assessment and support of CPS [58], whether by providing 
teachers with feedback on CPS in student groups or enabling 
just-in-time interventions to steer groups of problem solvers in 
the right direction. 


1.1 Background and Related Work 

We first present a brief discussion on theoretical frameworks of 
CPS to situate the CPS skills modeled in this study within the 
CPS literature. Then, we discuss prior work on computational 
models of CPS, specifically focusing on language-based models. 


1.1.1 Frameworks of CPS 

CPS has been defined as problem solving activities that involve 
interactions among a group of individuals [47]. One early attempt 
to conceptualize CPS was by Roschelle and Teasley [57] who 
proposed a joint problem space model that emphasized shared 
understanding of the task as a central aspect of CPS. More 
recently, the Assessment and Teaching of Twenty-First Century 
Skills (ATC21S) framework [28, 30] described CPS through a 
measurable and teachable set of social and cognitive skills based 
on interaction, self-evaluation and goal setting. Relatedly, the 
PISA 2015 [49] framework conceptualized CPS as a complex 
process involving three collaborative dimensions that overlap 
with four problem-solving processes resulting in 12 CPS skills. 
Building on these frameworks, Sun et al. [68] proposed a 
generalized competency framework for CPS skills based on 
interactions among triads, which defines a hierarchical CPS 
model involving three high-level facets of CPS, each composed 
of sub-facets and associated behavioral indicators. Another 
approach, and the framework adopted in this work, is the in-task 
assessment framework [34]. Informed by principles of evidence- 
centered design [41], this framework characterizes CPS through a 
hierarchical ontology [3], which lays out theoretically-grounded, 
generalizable CPS skills along with behavioral indicators of 
these skills. 


1.1.2. Computational Models of CPS 

The stream of interactions generated during problem solving is 
considered the richest source of information about a team’s 
knowledge, skills, and abilities [27, 38]. Accordingly, prior 
research has used non-verbal behavioral signals like facial 
expressions to detect rapport loss in small groups during open- 
ended discussions [43]. Multimodal combinations of facial 
expressions, acoustics and prosody, eye gaze, and task context 
have been explored to predict CPS outcomes like task 
performance [42, 67]. Additionally, learning gains [32, 50], 
subjective performance [72] and CPS competence [13, 14] have 
been modelled using multimodal signals. 


Focusing our review on studies that explored the use of language 
and speech based data, researchers have successfully used 
language to model CPS processes like idea sharing [24, 29], 
negotiation [65], and argumentation [58], as well as CPS 
outcomes such as task performance [10, 44, 51] and learning 
gains [55]. A common NLP approach involves quantifying the 
frequency of words and word phrases (n-grams) [24, 29, 44, 54, 
58]. Further, some research has experimented with the use of 
additional lexical features like punctuation [24, 29, 58], part-of- 
speech tags [21, 44, 58], or emoticons [29]. In addition to using 
lexical features from language itself, researchers have derived 
features from conversational data which index team and 


conversational dynamics (e.g., turn taking). This approach has 
been used to provide feedback on collaboration [59], identify 
sociocognitive roles [20], and model intra- and interpersonal 
dynamics [19] during CPS. 


Closely related to our work, Hao et al. [29] used pre-selected n- 
grams and emoticons to model four CPS facets of sharing ideas, 
negotiating, regulating problem-solving activities, and 
maintaining communication. Their study involved data collected 
from 1000 participants with at least one year of college 
experience randomly grouped into dyads. They used a linear 
chain conditional random field and extracted lexical features 
from sequential text chats between dyads. They found that 
sequential modeling achieved an average accuracy of 73.2%, 
which outperformed a majority-class baseline accuracy of 29%, 
and slightly outperformed standard classifiers (accuracies of 
66.9% to 71.9%). 


Whereas the Hao study analyzed text-chats among dyads, Stewart 
et al. [65] modeled the three CPS facets of construction of shared 
knowledge, negotiation and coordination, and maintaining team 
function from spoken trialogues (conversations among triads). 
The study involved 32 triads of undergraduate students from a 
medium-sized private university, engaged in a 20-minute 
computer programming task using video conferencing software in 
a lab setting. They used ASR to generate transcripts of the 
team’s speech during problem solving, from which they derived 
n-gram features for modeling. They obtained area under the 
receiver operating characteristic curve (AUROC) scores of .85, 
.77 and .77 for the three CPS facets using random forest 
classifiers, exceeding chance baselines of 0.5. In a follow-up 
study [66], they investigated whether including additional 
modalities (facial expression, acoustic-prosodic features, task 
context) in addition to language improved classification accuracy. 
They found that a combination of language and task context 
yielded slight improvement over unimodal language models. 


1.2 Current Study and Novelty 

There are several novel aspects of this work. First, although 
recent work [65, 66] has successfully used ASR and NLP to 
automatically analyze speech during CPS in the lab, it is 
currently unknown whether this approach can be effective in the 
wild, for example in noisy real-world classrooms where CPS 
interactions would occur. Lab environments have the advantage 
of being free from ambient noises, distractions from other 
students, and various other complicating factors present in school 
environments. 


Further, previous work has been limited to adults, namely 
undergraduate students. However, given the importance of CPS, 
it is imperative that technologies be developed that can help 
instruct and support CPS in middle and high school-aged 
students. Therefore, a second important question is whether this 
approach can be applied to children, who may have differing CPS 
abilities and communication styles. An accompanying question is 
whether ASR can provide sufficiently accurate transcripts of 
children’s speech, as research has documented the degradation of 
ASR performance on children’s speech due to ASR systems 
primarily trained on adult speech, and age-dependent spectral 
and temporal variability in speech signals [26, 45, 53]. 


We address these questions by recording audio of remote CPS 
among middle and high school students in both the lab and 
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computer-enabled classrooms with multiple teams interacting. 
We show for the first time that in noisy school environments, 
ASR can provide transcripts of sufficient accuracy to model CPS 
skills. Additionally, we quantify the decrease in predictive 
accuracy that can be attributed to ASR error (vs. NLP error) by 
comparing with models trained on human transcripts, and 
comparing lab- vs. classroom- environments. 


Finally, an open question in this domain is which NLP 
algorithms should be used to automatically analyze CPS 
language. We explore the use of deep transfer learning for this 
NLP problem. Recent advances in state-of-the-art NLP have been 
attained by adapting attention-based language models [71], pre- 
trained on large amounts of unlabeled data, to specific NLP tasks 
(e.g., text classification) [31]. We demonstrate the efficacy of this 
approach, using the popular Bidirectional Encoder 
Representations from Transformers (BERT) model [18] for our 
NLP task, and compare results with a more traditional n-gram 
approach using random forest classifiers. We also investigate 
whether a sequential classifier, which considers adjacent (i.e. 
previous, subsequent) utterances for contextual cues, yields 
improved performance over single utterance classifiers. We 
present a method, similar to the approaches used in [12, 69], to 
capture adjacent utterances for context by constructing a special 
input representation for the BERT model, which improves 
classification accuracy. 


2. METHOD 
2.1 Data Collection 


2.1.1 Contexts 

Our primary data collection occurred in one United States east 
coast public middle school and one public high school from the 
same district. The study was run over two data collection 
periods. The first period included 61 students in the high school 
and 44 students in the middle school. Here, students participated 
in two 43 minute class periods. The second collection included 
18 students from the same middle school. Because we did not 
have control over the acoustic environment in the school context, 
we also collected supplementary data from 18 students in the lab. 
In the second collection, students completed one 90 minute 
session. In both collections, students in the school environment 
completed the study from a computer lab in the school in which 
other students were also participating in the study. Data 
collection occurred prior to the COVID-19 pandemic, and as such 
classrooms were at normal capacity. Students in both 
environments were equipped with a personal headset and 
microphone (MPOW 071 USB Headset). 


2.1.2. Participants 

In all, 141 middle and high school students (age range: 12-15) 
completed some or all of the study. However, only a subset of 74 
sessions (a session entails one dyad completing one of the tasks) 
were included in this analysis. Participants were excluded for the 
following reasons: we experienced technical challenges on the 
first day of data collection, either team member did not complete 
a consent form, one team member did not show up, or there were 
quality issues with the recorded audio stream. Our analyzed 
dataset consisted of 88 students (65% female; mean age = 13.6, 
SD = 0.90). The lab subset contained 18 students (50% female; 
mean age = 13.6, SD = 1.01) and the school subset contained 70 


students (69% female; mean age = 13.6, SD = 0.87). The sample 
of 88 students was quite diverse with 26.1% self-reporting as 
Black/African American, 19.3% Hispanic/Latino, 15.9% 
Multiracial, 13.6% Asian/Asian American, 12.5% White, 2.3% 
American Indian/Alaska Native, 6.8% reported “Other”, and 
3.4% did not report ethnicity. 


2.1.3 CPS Tasks 

The study involved two separate CPS tasks. In one task on linear 
functions and argumentation (T-Shirt Math Task [1]), students 
worked together through a series of task items in which they 
sought to determine which of three t-shirt companies was the 
best choice for a student council to purchase t-shirts for 
classmates. They compared three companies with differing 
variable costs (price per shirt) and fixed costs (upfront fee) to 
determine which company should be chosen given the number of 
t-shirts to be purchased. Individual questions included populating 
the cost equation y = mx + b according to the costs of each 
company (see Figure 1B), identifying the correct graph for a 
given company’s cost equation, and providing a recommendation 
as to which company was the best deal. During this task, only 
one student controlled the screen at a time (i.e. to enter responses 
to the questions), and the two students could alternate control as 
they chose. 


A. 


B. 


1. EZ Tees charges $8 per shirt, and has a one-time upfront fee of $200. 
2 Perfect Printing charges $4 per shirt, has a one-time upfront fee of $500. 
3. Shirts For Less charges a fee of $1,500 for up to 350 shirts 


3. Please discuss with your partner and use the options below to enter the cost equation values for EZ Tees. 


y=| Choose... ¥]x+| Choose... v 
8 


500 


4 

200 
350 
300 
0 | 


1 
1500 | 


Figure 1. Screenshot examples of the videoconferencing setup 
and two CPS tasks. (A) Shows a level in Physics Playground, 
(B) shows a question from the T-Shirt Math Task 
(reproduced with permission from ETS). 


The second task (Physics Playground [62]) was an educational 
physics game designed to help students learn concepts in 
Newtonian physics. In this task, students completed a series of 
six game levels in which they were tasked with drawing objects 
(e.g., lever, ramp, springboard) to guide a ball to hit a balloon 
target (see Figure 1A in which students are drawing a weight 
attached to the springboard to launch the ball towards the 
balloon). During this task, only one student controlled the game 
at a time. One student was selected to control first, and after 
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three levels had been completed (or half of the allotted time had 
elapsed), control was switched to the other student for the 
following three levels. Whereas the math task resembles more 
traditional school work and is more constrained by prior 
knowledge, the physics game provides more opportunities for 
creative exploration [35]. 


2.1.4 Procedure 

Students were randomly assigned to pairs (27 mixed-gender, 17 
same-gender pairs) and each student first individually completed 
a series of pre-surveys; details are not relevant here. Once both 
students in the pair completed the pre-surveys, a researcher 
enabled audio and video recording on each student’s computer 
using Zoom video conferencing software (https://zoom.us) to 
record students’ computer screens, faces, and voices. The student 
teams then worked together to complete the two CPS tasks, 
either on a different day or the same day (see above). The order 
of the tasks was counterbalanced so that half of the teams 
completed Physics Playground first and the other half completed 
the T-Shirt Math Task first. After completing each task students 
individually completed additional questionnaires not analyzed 
here. 


2.2 CPS Ontology and CPS Skills 
2.2.1 CPS Ontology (Framework) 


We used a competency model represented as an ontology [3, 4] 
(similar to a concept map), which lays out the components of 
CPS and their relationships, along with indicators of CPS skills. 
The development of the ontology was based on discussions with 
subject matter experts as well as a literature review in relevant 
areas such as computer-supported collaborative learning, 
individual problem solving, communication, and linguistics [30, 
39, 46, 48, 49, 64]. 


Our CPS ontology [3] includes nine high-level CPS skills across 
social and cognitive dimensions and sub-skills that correspond to 
each high-level skill. The social dimension includes four CPS 
skills: (1) Maintaining communication corresponds to content 
irrelevant social communications among teammates (e.g., 
greeting teammates or engaging in off-topic conversations); (2) 
Sharing information corresponds to task-relevant communication 
that is useful for solving the problem (e.g., sharing one’s own 
knowledge, sharing the state of one’s understanding); (3) 
Establishing shared understanding includes communication used 
to learn the perspectives of others and ensure that what has been 
said is understood by teammates (e.g., requesting information 


from teammates, providing responses that indicate 
comprehension); and (4) Negotiating corresponds to 
communication used to express agreement, express 


disagreement, or resolve conflicts that arise. 


The cognitive dimension includes five CPS skills: (1) Exploring 
and understanding corresponds to communication and actions 
used to explore the environments in which teammates are 
working or understand the problem at hand (e.g., rereading 
problem prompts); (2) Representing and formulating includes 
communication used to build a mental representation of the 
problem and formulate hypotheses; (3) Planning corresponds to 
communication used to develop a plan for solving the problem 
(e.g., determining goals or establishing steps for carrying out a 
plan); (4) Executing corresponds to actions and communication 
used to carry out a plan (e.g., taking steps to carry out a plan, 
reporting to teammates what steps you are taking, or making 
suggestions to teammates about what steps they should take to 
carry out the plan); and (5) Monitoring includes communication 
used to monitor progress towards the goal or monitor teammates 
(e.g., checking the progress or status of teammates). 


Table 1. The 7 CPS skills modeled, ordered from highest to lowest prevalence 


CPS Skill Dimension 


Base 
Rate 


Example Human Transcript 


Corresponding ASR Transcript 


(Math) “Okay so first I think we 
should create like three equations to 


(Math) “Which one do you think is 


(Physics) “Umm no let’s just do 
another idea I don’t think it’s gonna 


(Physics) “Okay and now put a 


(Physics) ““(laughs) Oh no this game 
is funny bro yeah I don't know what 


(Physics) “That didn't work oh no” 


(Math) “Alright now we have to 
find a graph for this one now” 


Sharing Information .26 Social 

for each company” 
Establishing Shared ‘ 
Understanding a Social the best one” 
Negotiating .16 Social 

work anymore” 
Executing 14 Cognitive weight down on that” 
agama ot 07 Social 
Communication pt 

to do 
Monitoring .06 Cognitive 
Planning 05 Cognitive 

58 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


“Okay Sir thank first we should 
create like three D creations for 
each arm company” 


“Twenty it’s the best” 


“Let's just do it another day I don't 
think it's going to work anymore” 


“Okay and now put a weight down 
on the” 


“This came funny I would like to 
do” 


“That didn't recall about” 


“Now we have to find a crusher this 
one now” 


2.2.2 CPS Coding 

Video recordings of student task sessions were segmented at the 
turn (or utterance) level and then coded by three trained raters 
using Dedoose qualitative analysis software [17]. For the coding, 
raters viewed each turn for each individual in a team and then 
labeled the turn as one of the CPS skills from the CPS ontology. 
To establish reliability, the three trained raters triple coded 20% 
of the videos. Intraclass correlations (ICCs) were used to 
estimate interrater reliability across rater judgments, as it can 
provide information about the consistency of the judgments 
among raters. The median ICC across the CPS skill ratings was 
.93, corresponding to excellent agreement [11]. 


Once reliability was established, the remaining videos were split 
among the three raters and coded independently. A total of 
10,239 turns were coded across 80 CPS sessions with an average 
of 128 turns per session (SD = 70.5). Two CPS skills (exploring 
and understanding, and representing and formulating) occurred 
very infrequently (base rate < 1%) and were excluded from our 
analysis. The remaining seven CPS skills, with their base rate, 
cognitive/social dimension, and a sample utterance from the 
dataset, are shown in Table 1. 


2.3 ASR and Human Transcript Generation 
After segmenting and coding each utterance, we used the IBM 
Watson speech-to-text service [33] to generate ASR transcripts 
for each video. The service outputs transcripts with word-level 
start and stop times, as well as word-level confidence (between 0 
and 1) for each word recognized. We constructed the transcript 
for each coded utterance by concatenating transcribed words 
within the utterance’s human segmented time window. The 
confidence for each utterance was computed by taking the mean 
word confidence over all words in the utterance transcript. 
Utterances in which no words were recognized were assigned a 
confidence of 0. Because a single audio stream of each session 
was recorded (rather than individual audio streams from each 
student), the ASR transcripts can contain words from both 
speakers if there was overlap (elaborated below). 


We also manually transcribed each utterance from the CPS 
videos. Human transcribers viewed the video segment (with 
audio) of each coded utterance and transcribed the words spoken 
by the indicated speaker (each utterance was coded for an 
individual student). Speech from the other student, if present in 
the segment, was not transcribed. Prior to transcription, 
guidelines were established among the human transcribers to 
ensure consistency in transcribing informal words or phrases 
(e.g., gonna, c’mon). 


Because the segmented utterances sometimes contained speech 
from both speakers, we had alignment inconsistencies, as the 
ASR transcribed all words in a segment while the human 
transcripts only contained words spoken by the indicated student. 
To better assess ASR accuracy, we randomly sampled 10 
utterances from each CPS session (8.5% of the data) and re- 
transcribed the utterances to include all words spoken in the 
segment, regardless of speaker. We refer to this as the Human 
Transcript Subset. We then computed a word error rate (WER) 
[9] for each utterance in this subset defined as (substitutions + 
insertions + deletions) / (words in human transcript), using the 
python package Jiwer [70]. 


2.4 Analyzed Dataset 

Our dataset contains 74 CPS task sessions from 44 teams. This 
includes 30 teams with both the math and physics tasks in the 
dataset, nine teams with only the math task and five teams with 
only the physics task. 18 of the 74 sessions occurred in the lab, 
and the remaining 56 sessions occurred in school environments. 
The dataset consists of 8,660 utterances coded with CPS skills, 
and corresponding transcripts. Of these utterances, 2,751 (32%) 
were from lab sessions and the other 5,909 (68%) were from 
school sessions. 


2.5 Machine Learning 

We adopted a supervised classification approach to predict the 
ground truth CPS skill for each utterance. We first implemented 
a bag-of-n-grams approach using a Random Forest Classifier, as 
recent literature [65] has shown this method to be effective for 
the classification of CPS utterances. Next, we explored deep 
transfer learning as a means to improve upon this method. In 
particular, we leveraged pre-trained language models and 
employed the popular Bidirectional Encoder Representations 
from Transformers (BERT) model [18]. Additionally, we tested a 
method (BERT-seq) which takes a sequence of utterances as 
input (the utterance to classify plus the previous and subsequent 
utterances) to capture contextual information, in order to 
determine if including adjacent utterances improves 
classification accuracy. We trained separate models (RF, BERT, 
and BERT-seq) using the ASR transcripts and human transcripts 
as input. 


2.5.1 Random Forest N-Grams 

We first followed the approach outlined in [65] and trained 
Random Forest Classifiers to predict the CPS skill for each 
utterance using n-gram features. We used unigrams (words) and 
bigrams (two-word phrases) as the features for our Random 
Forest classifiers. Trigrams and beyond were not used since very 
few unique trigrams (only 6) occurred in >1% of utterances. We 
explored excluding n-grams that occurred at less than a minimum 
frequency in the training dataset, testing values of 0% (no 
filtering), 1% and 2% as hyperparameters. We used the scikit- 
learn [52] library’s implementation of the Random Forest 
Classifier with 200 estimators. 


2.5.2 BERT 

We used a transfer learning approach and fine-tuned pre-trained 
BERT models to predict the CPS skill for each utterance. This 
entailed starting with a BERT model pre-trained on a large 
amount of unlabeled data, then fine-tuning it on our dataset of 
transcribed utterances and corresponding labels (CPS skills). We 
first processed the transcribed utterances using WordPiece 
tokenization [61]. This process entailed splitting an utterance 
into a sequence of words, or parts of words. Each unique word or 
word piece was then converted to an integer (called a token) 
according to BERT’s pre-specified vocabulary. Finally, special 
tokens ([(CLS] and [SEP]) were appended to the beginning and 
end of this sequence of integers and the sequence was provided 
as input to BERT (see Figure 2A). BERT mapped each input 
token to a 768-dimensional embedding, which serves as a 
semantic representation of the input token (the embedding of the 
special [CLS] and [SEP] tokens capture a _ semantic 
representation of the entire sequence of input tokens). 
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Figure 2. (A) The traditional BERT model used for text classification. (B) Our BERT-seq model which captures contextual 
information from the previous and subsequent utterances during classification. 


For classification, the embedding of the [CLS] token was used as 
input to a fully connected layer (classifier), which output 
predicted probabilities for the seven CPS skills. We used 
multiclass learning, meaning that all seven CPS skills were 
predicted by one model. 


2.5.3 BERT-seq 

We propose a method to incorporate contextual utterances during 
classification by creating a special input representation, without 
augmenting the BERT architecture. This method takes a 
sequence of three utterances as input (the utterance to classify 
plus the previous and subsequent utterances), which are used to 
train two separate BERT models, each including either the 
previous or subsequent utterance in the BERT input (see Figure 
2B). To add a pair of adjacent utterances to the input, we first 
processed each utterance individually using WordPiece 
tokenization as described above. The special [CLS] token was 
then added to the beginning of this sequence, and a [SEP] token 
was added to the end of both the first and second utterances. To 
classify the utterance, the embedding of the corresponding [SEP] 
token was used as input to a fully connected layer, which output 
predictions for the 7 CPS skills. Finally, the predicted 
probabilities of the previous and subsequent utterance models 
were averaged. This method of representing a sequence of 
utterances enables the self-attention layers of BERT to leverage 
contextual information from the previous and subsequent 
utterances, while still utilizing the pre-trained BERT weights. 


For both BERT and BERT-seq we started with the transformers 
[73] library’s implementation of the BertModel with the “bert- 
base-uncased” pre-trained weights, and used the BertTokenizer 
to process our utterances. We then fine-tuned the models for 
three epochs using a batch size of 16. We found that fine-tuning 
beyond three epochs did not substantially improve model 
performance. 


2.5.4 Cross Validation 

We used team-level 10-fold cross-validation to assess the 
accuracy of our classifiers. With our dataset of 44 teams, this 
entailed training a model with utterances from 90% of teams (39 


or 40 teams), then evaluating the model’s predictive accuracy on 
a test set containing utterances from the 10% of teams withheld 
during training (4 or 5 teams). This process was repeated ten 
times, such that every team appeared in the test set once. To 
compute accuracy metrics, predictions from all ten folds were 
aggregated and a single metric was computed on the full dataset. 
Team-level cross validation yields a better assessment of the 
method’s generalizability to new teams because it ensures each 
model is never trained and evaluated on utterances from the 
same speaker. We used identical cross-validation folds for the 
RF, BERT and BERT-seq models as well as the human and ASR 
transcripts to ensure that differences in performance were not an 
artifact of the folds used. This experiment was repeated for 5 
iterations, and different randomized cross-validation folds were 
used for each iteration. 


3. RESULTS 
3.1 ASR Accuracy 


We compared WER in the lab and school subsets in order to 
quantify the speech recognition error that could be attributed to 
noisy school environments, as opposed to other factors such as 
difficulty recognizing children’s speech, whispering or 
mumbling, audio quality, or inevitable ASR mistakes. We used 
the Human Transcript Subset as described in Section 2.3 for this 
comparison. The distributions of WER in the lab and school 
environments are shown in Figure 3. We found that WER was 
much lower in the lab environment than in schools (mean WER 
of .54 and .76, median WER of .50 and .91, respectively), 
indicating that significant ASR error is due to noisy school 
environments. We performed a non-parametric Kruskal-Wallis 
test [40] to statistically compare WER in the lab and school 
samples, and found that they differed significantly (y7(1) = 62.13, 
p < .001). 


As evident in Figure 3, a large proportion (47%) of the school 
utterances had a WER of 1 (compared to 19% for lab data), 
meaning no words were correctly recognized. However, WER 
was also high in the controlled lab environment, suggesting that 
speech recognition error may in part be attributable to factors 
beyond the complications of noisy school environments. 
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Figure 3. Gaussian kernel density estimates of the 
distribution of word error rates in the lab and school 
environments. 


We also investigated the correlation between WER and ASR 
confidence to determine whether the confidence values produced 
by the ASR provided a good estimate of transcript accuracy. We 
found that WER and ASR confidence were significantly 
correlated (Spearman rho = -.74, p < .001). 


3.2 Model Comparison 

Next we compared the performance of our three NLP models 
(RF, BERT, BERT-seq). The models output a probability from 0 
to | that an utterance is coded with each CPS skill. Accordingly, 
we report the area under the receiver operating characteristic 
curve (AUROC) for each skill, a common accuracy metric for 
model performance [6] which takes into account the true positive 


and false positive tradeoff across classification thresholds. Mean 
AUROC scores (over the five iterations) for the RF, BERT and 
BERT-seq models, using both human and ASR transcripts are 
reported in Table 2. We also report a chance baseline, created by 
randomly shuffling the labels within each CPS session and 
computing accuracy accordingly. Because shuffling is within 
sessions, the AUROCs for the shuffled models will slightly 
deviate from the 0.5 chance baseline. To determine if the three 
model’s AUROC scores were significantly different for each CPS 
skill, we used a bootstrap method to statistically compare the 
AUROC values. Since five iterations of this experiment were 
conducted, we selected the model corresponding to the median 
AUROC value across the five iterations (for both human and 
ASR transcripts) on each CPS skill for statistical analysis. We 
performed this analysis in R using the pROC package [56] with 
2,000 bootstrap permutations. Finally, we adjusted the resulting 
p-values using a false discovery rate (FDR) correction [5] to 
account for multiple testing across the seven CPS skills. 


Without exception BERT-seq quantitatively yielded the highest 
AUROC scores for all seven CPS skills using both human and 
ASR transcripts, indicating that our method of incorporating 
adjacent utterances improves performance over single utterance 
classifiers. On average, BERT outperformed the RF model on 
both human and ASR transcripts, although there were some 
skills for which the RF AUROC scores were higher. From the 
statistical analysis described above, we found that with ASR 
transcripts BERT-seq had a significant advantage over the other 
two models for most skills (four of seven for BERT, five of seven 
for RF). We also found that there was no significant difference 
between BERT and RF for six of seven skills. 


Table 2. Mean AUROC values (across 5 iterations) of the RF N-gram, BERT, and BERT-seq models on ASR and 
Human transcripts for all CPS skills. 


CPS Skill 


ASR Transcripts 


Human Transcripts 


RF BERT  BERT-seq RF 


Sharing Information 


0.711 0.7458 
Establishing Shared Understanding 0.713 0.724 
Negotiating 0.721 0.719 
Executing 0.745 0.767 
Maintaining Communication 0.673 0.667 
Monitoring 0.632 0.594 
Planning 0.700 0.692 
Micro Avg. 0.773 0.782 


BERT  BERT-seq Shuffled 
0.756 ® 0.837. 0.866 0.8778 0.540 
0.740 88 0.872 0.8948 0.90788 0.509 
0.7418 0.896 0.901 0.916 88 0.510 
0.7848 0.897 0.9148 0.9268 0.574 
0.750 88 0.849 = 0.853 0.901 88 0.557 
0.677 88 0.812 0.792 0.843 88 0.513 
0.718 0.8618 0.818 0.8728 0.502 
0.799 0.887 0.895 0.914 0.607 


R and ® indicate the AUROC score was significantly higher than the RF and/or BERT models, respectively. Neither RF nor BERT ever 
outperformed BERT-seq. 
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We observed a similar pattern on the human transcripts, where 
BERT-seq significantly outperformed BERT on five of seven 
skills and RF on six of seven skills. Interestingly, on human 
transcripts the advantage of BERT over RF increased, with 
BERT having significantly higher scores on three skills, while 
RF was significantly better on only one. This finding suggests 
that with high quality transcripts which accurately capture the 
content of an utterance, BERT was the better model, whereas 
with noisy ASR transcripts there was no clear difference. 


These results indicate that BERT-seq quantitatively 
outperformed both the traditional BERT and the RF n-gram 
approach for all seven CPS skills, using both the human and 
ASR transcripts. However, the statistical analysis revealed that 
for some CPS skills, this advantage was not statistically 
significant. As BERT-seq was the best model across CPS skills, 
we refer to these results in our comparison of human and ASR 
transcripts, and throughout the rest of this paper. 


3.3 ASR vs. Human Transcripts 

We found that using the ASR transcripts as input, our best model 
(BERT-seq) was able to accurately classify the seven CPS skills, 
yielding a micro-average AUROC score of .799. However, when 
the human transcripts were used, this average increased to .914 
(see Table 2). We compared the human and ASR transcript 
results using the bootstrap method described above, and found 
that the human transcript AUROC scores were significantly 
(FDR corrected p < .05) higher than the ASR transcript scores 
for all seven CPS skills, an unsurprising result given the high 
word error rates in the ASR transcripts. However, we note that 
despite significant loss in performance due to speech recognition 
error, our model easily outperformed a shuffled baseline (micro- 
average AUROC of .607), supporting the hypothesis that CPS 
skills can be automatically predicted from ASR transcripts. 


3.4 Classification Accuracy in Lab and School 


Environments 

Next we compared classification accuracy in the lab and school 
environments in order to investigate the extent to which higher 
rates of ASR error in the school subset affected model 
performance. We report AUROC scores for the lab and school 
environments in Table 3. We found that on average, 
classification accuracy was substantially lower in the school 
subset compared to the lab subset (micro-average AUROC of 
.783 and .830, respectively). Further, for every individual skill, 
AUROC scores were quantitatively higher in the lab subset than 
in the school subset, with differences in AUROC values for 
individual skills ranging from .031 (Executing) to .102 
(Negotiating). We again used the bootstrap method to 
statistically compare AUROC scores in the lab and school for 
each skill and found that scores were significantly higher in the 
lab subset for five out of seven CPS skills (see Table 3). 


3.5 Classification Accuracy as a Function of 
ASR Confidence 


Lastly, we examined the relationship between ASR confidence 
and classification accuracy. As discussed in section 3.1, the ASR 
confidence is a good proxy for word error rate, as the two values 
are significantly correlated. Therefore, we separated our 8,660 
utterances into ten ASR confidence bins (0.0 — 0.1, etc.) and 


computed the micro-average AUROC score for each bin. The 
distribution of utterances and corresponding AUROC scores for 
each bin are shown in Figure 4A and 4B, respectively. Figure 4B 
also shows the human transcript AUROC score as a benchmark 
of the accuracy that would be expected under conditions of near- 
perfect speech recognition. The shuffled baseline is also shown 
to visualize improvement over chance. 


Table 3. Mean AUROC scores (across 5 iterations) 
for each CPS skill in Lab and School environments. Results 
are from the BERT-seq model using ASR transcripts. Values 
marked with * were significantly higher in the Lab vs. 
School. 


CPS Skill Lab 


School 


AUC Base AUC Base 


Rate Rate 
Sharing Information | 0.782*  .25 0.743 27 
Establishing Shared | 0.786*  .26 0.716 = .25 
Understanding 
Negotiating 0.807*  .18 0.705 15 
Executing 0.804 = .15 0.773 13 
Maintaining 0.803*  .03 0.717 ~—-.08 
Communication 
Monitoring 0.701 05 0.663 =—.07 
Planning 0.760*  .06 0.688  .04 
Micro Avg. 0.830 0.783 


We found that a large proportion of utterances (20%) fall in the 
[0.0 - 0.1) bin, indicating that the ASR had little to no confidence 
in their content. In fact, nearly all (97%) of the utterances in this 
bin have an empty ASR transcript, meaning no words were 
recognized during the utterance’s segmented time window. In 
many cases, this occurred due to the students whispering or 
mumbling, which the ASR was unable to recognize. Excepting 
the significant zero inflation, the utterances appeared to be 
normally distributed around the [0.6 - 0.7) bin. 


We observed a strong correlation between ASR confidence bin 
and classification accuracy (Spearman rho = .94, p < .001). 
Unsurprisingly, we found that for low confidence transcripts (< 
0.3) a substantial gap exists between the ASR transcript AUROC 
score and the benchmark human transcript score (see Figure 4B). 
On these low confidence transcripts, model performance is near 
the shuffled chance baseline. Interestingly, despite many (77%) 
of these low confidence transcripts containing no words, the 
model was still able to outperform the chance baseline by 
learning the distribution of skills among empty transcripts in the 
training data. We found that accuracy increases steadily among 
the medium confidence transcripts (0.3 - 0.7). For high 
confidence transcripts (=0.7), AUROC scores are near (though 
still lower than) the benchmark human transcript values. The 
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relationship between ASR confidence and classification accuracy 
indicates that it might be viable to filter out utterances with low 
confidence to improve reliability for downstream applications. 


A. 
" 100-028 1-02) [02-03) [03-04 04-05) [05.0 [06-07) [07-08 08-09) [09-1 


ASR Confidence 


AUROC Micro Average (95% Cl) 


Chance Baseline 


Figure 4. (A) Distribution of ASR confidence on all 8,660 
utterances. (B) Model accuracy as a function of ASR 
confidence. Micro-average AUROC scores across the 7 CPS 
skills (with 95% CI across 5 iterations) are plotted for 
Human and ASR transcripts. 


4. DISCUSSION 


We investigated the feasibility of using automatic speech 
recognition and natural language processing to automatically 
classify student speech with CPS skills using data collected in 
both lab and real-world school environments. We compared 
performance using imperfect ASR transcripts with human 
transcripts, investigated differences between the lab and school 
environments, and explored three NLP approaches including bag- 
of-n-grams and deep transfer learning. In the rest of this section 
we discuss our main findings, applications of our models, as well 
as limitations and future directions of research. 


4.1 Main Findings 

We found that it is feasible to use ASR to transcribe middle and 
high school student’s speech during CPS in both lab and school 
environments. However, we found that significant speech 
recognition error is introduced when speech is recorded in 
schools (mean WER of .76), likely as a result of noisy 
environments and distractions from other students. That said, 
speech recognition error was also high in the lab environment 
(mean WER of .54), suggesting that there may still be 
fundamental limitations associated with using ASR on children’s 
speech in the context of remote CPS. 


Despite imperfect speech recognition, we demonstrated that it is 
possible to automatically predict CPS skills from student speech 


in a real-world school environment. We built team-independent 
models that were able to predict CPS skills with reasonable 
accuracy (micro-average AUROC of .80) using ASR transcripts. 
Importantly, this result outperformed a shuffled baseline (micro- 
average AUROC of .61) by a significant margin. This finding is 
encouraging because it was previously unknown whether ASR 
could yield transcripts of sufficient quality to model CPS skills in 
noisy environments. Further, we demonstrated that by using 
high-fidelity human transcripts, this accuracy could be 
significantly improved (micro-average AUROC of .91). We 
demonstrated that in the absence of ASR error our NLP models 
were highly accurate, suggesting a useful upper bound of what 
can be achieved from spoken content alone. 


We also improved upon NLP approaches previously used in CPS 
literature, demonstrating the advantage of deep transfer learning 
over standard classifiers for modeling CPS language. We found 
that on average, using both ASR and human transcripts, the deep 
transfer learning model (BERT) achieved slightly better accuracy 
than the Random Forest n-gram model (though the two were 
statistically tied for 3/7 CPS skills with human transcripts and 
6/7 skills with ASR transcripts). This finding was unsurprising 
given that pre-trained language models have achieved state-of- 
the-art performance on many NLP benchmark tasks, including 
text classification. 


Importantly, we found that we were able to further improve 
classification accuracy by constructing an input representation 
that enables BERT to capture information from adjacent 
utterances. This method showed significant improvement over 
the single utterance BERT and RF models, providing preliminary 
evidence of its viability. This finding suggests that in CPS, the 
context of an utterance (what was said before and after) may be 
important for accurate identification of particular CPS skills. 


Finally, we examined the relationship between ASR confidence — 
a proxy for transcription quality — and classification accuracy. 
We found that the two were highly correlated, suggesting that 
downstream applications may be able to improve reliability of 
predictions by filtering out low confidence transcripts. 


4.2 Applications 

A key application of this work is the automatic assessment of 
CPS skills from open-ended speech in classrooms and beyond. 
As previously discussed, analyzing verbal communication for 
evidence of CPS skills is a costly and time-intensive process 
when trained human coders are used. Our findings suggest that 
automated methods using ASR and NLP may provide a viable 
alternative to the human-coding process. These automated 
methods hold great potential in improving the assessment and 
training of CPS skills, a priority of modern education [49]. 
However, given the imperfect accuracy of our models, and 
unanswered questions regarding how this approach may 
generalize to students with differing communication styles or 
cultural and linguistic backgrounds, this approach should be 
limited to formative assessment [63] focused on learning and 
improvement, rather than evaluation. 


Our approach could advance this goal in several ways. For 
example, automatically generated reports could be sent to a 
teacher monitoring many groups of students engaged in CPS, 
informing the teacher of the extent to which each group is 
demonstrating CPS skills. Such a system could help the teacher 
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identify which groups need support and allocate their limited 
presence toward assisting those groups. Similarly, these reports 
could be used to identify individual student’s strengths and 
weaknesses, and set appropriate goals for improvement. For 
instance, a student who frequently shares information yet seldom 
engages in negotiation or establishing shared understanding 
could be encouraged to listen to the ideas of their teammates and 
work to build on those ideas together. 


In addition to passive assessment and off-line feedback, this 
approach could be leveraged by next-generation intelligent 
systems that actively monitor ongoing CPS and dynamically 
intervene in real time to yield improved CPS outcomes [15], or 
provide personalized on-line feedback to students. For example, 
a group frequently engaging in off-topic conversation could be 
prompted by the system to focus back on the problem-solving 
task, or a particular student within a group who hasn’t shared 
information could be encouraged to share their ideas with the 
team. The specific intervention strategies, including when to 
intervene, how to present the intervention, and who the 
intervention should be targeted at (whole group vs. individual 
student) await design, testing, and refinement. 


Importantly, a technology devised to assist in the training and 
assessment of CPS does little good if it is confined to the lab. 
Thus, the present results take a step towards the development of 
a system that can support CPS in real-world classrooms by 
monitoring open-ended verbal communication for CPS skills. 


4.3 Limitations 

There were some limitations of this work. First, although we 
used an automated approach for utterance transcription and CPS 
skill prediction, the sessions were segmented into utterances 
beforehand by human coders. This is a limitation because a fully 
automated pipeline would require the ASR to automatically 
detect and segment recorded speech into individual utterances, 
an already difficult task that may be further complicated by noisy 
school environments or the peculiarities of children’s speech. 
Another related limitation is that due to the utterance 
segmentation and ASR transcription process we used, our ASR 
transcripts contain all speech that was recognized during an 
utterance’s segmented time window. This means that some ASR 
transcripts contain words from both speakers, which introduces 
alignment inconsistencies between the ASR transcript and the 
coded CPS skill because utterances were coded at the individual 
student level. In particular, this introduces noise into the ASR 
transcripts when student’s utterances overlap. 


Another limitation of this work is that we considered only 
linguistic features to predict the coded CPS skills. We expect 
that model performance can be improved by modeling not only 
what students say (language), but considering how they say it 
(acoustic-prosodic information) and in the context of what 
they’re doing (task-specific information). We hypothesize that 
the inclusion of these additional modalities may particularly 
improve performance for low confidence ASR transcripts, where 
the language transcribed by the speech recognizer is either 
missing altogether, or is a poor representation of what was 
actually said. Finally, although we demonstrated that our method 
for capturing contextual information from adjacent utterances 
improved accuracy, we did not compare this with other methods 


for incorporating contextual utterances such as conditional 
random fields or recurrent neural networks. 


4.4 Future Work 


The findings and limitations discussed in this section present 
several possibilities for improvement in future research. First, in 
order to develop a fully automated approach for modeling CPS 
skills, we plan to incorporate automatic utterance segmentation 
and speaker diarization into our ASR pipeline. Further, we plan 
to explore methods for incorporating information from other 
modalities in addition to language. For instance, including 
features such as acoustic-prosodic information, task context, 
facial expression, or body movement may enable more accurate 
prediction of CPS skills in cases where ASR fails to capture the 
content of an utterance. 


Another direction of future research involves further exploration 
of how contextual utterances can be used to improve 
classification accuracy. We demonstrated a method for 
incorporating adjacent utterances in our model input, which 
improved performance over single utterance classifiers. In future 
work, we will explore methods for capturing contextual 
information beyond the previous and subsequent utterances (e.g., 
the five previous utterances). We also plan to investigate how the 
approach demonstrated in this paper, which leverages the 
model’s attention mechanism to capture context, compares with 
other approaches (e.g., recurrent neural networks). 


In addition to exploring methods for improving the accuracy of 
our models, we plan to investigate the utility of our CPS models. 
An open question is how accurate model predictions need to be 
to provide useful and actionable estimates for assessment, 
feedback, or intervention. Specifically, recent work [2, 25] has 
clustered students using the frequency of CPS skills to derive 
theoretically grounded profiles of collaborative problem solvers 
(e.g., active collaborators, social loafers). We plan to investigate 
whether model-derived estimates of CPS skill frequencies will 
yield high agreement to the clustering produced using human 
codes. 


5. CONCLUSION 


We combined automatic speech recognition and natural language 
processing to automatically predict CPS skills from student 
speech during problem solving in both lab and real-world school 
environments. Our findings suggest that despite significant 
speech recognition error in school environments, it is possible to 
predict expert-coded CPS skills using automatically generated 
transcripts. These findings open many possibilities for next- 
generation technologies that can further the goal of improved 
CPS training, assessment, and support in schools. 


6. ACKNOWLEDGMENTS 

This research was supported by the Institute of Educational 
Sciences (IES R305A170432), the NSF National AI Institute for 
Student-AI Teaming (iSAT) (DRL 2019805) and NSF DUE 
1745442/1660877. The opinions expressed are those of the 
authors and do not represent views of the funding agencies. 


64 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


7. REFERENCES 


[1] 


[2] 


[3] 


[4] 


[5] 


[6] 


[7] 


[8] 


[9] 


[10] 


[11] 


[12] 


Andrews-Todd, J. et al. 2019. Collaborative Problem 
Solving Assessment in an Online Mathematics Task. 
ETS Research Report Series. 2019, 1 (2019). 
DOLhttps://doi.org/10.1002/ets2.12260. 


Andrews-Todd, J. et al. 2018. Identifying profiles of 
collaborative problem solvers in an online electronics 
environment. Proceedings of the 11th International 
Conference on Educational Data Mining, EDM 2018 
(2018). 


Andrews-Todd, J. and Forsyth, C.M. 2020. Exploring 
social and cognitive dimensions of collaborative 
problem solving in an open online simulation-based 
task. Computers in Human Behavior. 104, (2020). 
DOt:https://doi.org/10.1016/j.chb.2018.10.025. 


Andrews-Todd, J. and Kerr, D. 2019. Application of 
Ontologies for Assessing Collaborative Problem Solving 
Skills. International Journal of Testing. 19, 2 (2019). 
DOt:https://doi.org/10.1080/15305058.2019.1573823. 


Benjamini, Y. and Hochberg, Y. 1995. Controlling the 
False Discovery Rate: A Practical and Powerful 
Approach to Multiple Testing. Journal of the Royal 
Statistical Society: Series B (Methodological). 57, 1 
(1995). DOLhttps://doi.org/10.1111/).2517- 
6161.1995.tb0203 1.x. 


Bradley, A.P. 1997. The use of the area under the ROC 
curve in the evaluation of machine learning algorithms. 
Pattern Recognition. 30, 7 (1997). 
DOLhttps://doi.org/10.1016/S003 1-3203(96)00142-2. 


C. Graesser, A. et al. 2018. Challenges of Assessing 
Collaborative Problem Solving. Care, E., Griffin, P., 
Wilson, M. (Eds.), Assessment and teaching of 21st 
century skills: Research and applications. 75-91. 


Care, E. et al. 2016. Assessment of Collaborative 
Problem Solving in Education Environments. Applied 
Measurement in Education. 29, 4 (2016). 
DOLhttps://doi.org/10.1080/08957347.2016.1209204. 


Chen, S. et al. 1998. Evaluation metrics for language 
models. Proceedings of the DARPA Broadcast News 
Transcription and Understanding Workshop. (1998). 


Chopade, P. et al. 2019. CPSX: Using AI-Machine 
Learning for Mapping Human-Human Interaction and 
Measurement of CPS Teamwork Skills. 2019 IEEE 
International Symposium on Technologies for Homeland 
Security, HST 2019 (2019). 


Cicchetti, D. V. 1994. Guidelines, Criteria, and Rules of 
Thumb for Evaluating Normed and Standardized 
Assessment Instruments in Psychology. Psychological 
Assessment. 6, 4 (1994). 
DOt:https://doi.org/10.1037/1040-3590.6.4.284. 


Cohan, A. et al. 2020. Pretrained language models for 
sequential sentence classification. EMNLP-IJCNLP 
2019 - 2019 Conference on Empirical Methods in 
Natural Language Processing and 9th International 
Joint Conference on Natural Language Processing, 
Proceedings of the Conference (2020). 


[13] 


[14] 


[15] 


[16] 


[17] 


[18] 


[19] 


[20] 


[21] 


[22] 


[23] 


[24] 


[25] 


Cukurova, M. et al. 2020. Modelling collaborative 
problem-solving competence with transparent learning 
analytics: Is video data enough? ACM International 
Conference Proceeding Series (2020). 


Cukurova, M. et al. 2018. The NISPI framework: 
Analysing collaborative problem-solving from students 
physical interactions. Computers and Education. 116, 
(2018). 
DOt:https://doi.org/10.1016/j.compedu.2017.08.007. 


D’Mello, S. et al. 2019. Towards dynamic intelligent 
support for collaborative problem solving. CEUR 
Workshop Proceedings (2019). 


von Davier, A.A. et al. 2017. Interdisciplinary research 
agenda in support of assessment of collaborative 
problem solving: lessons learned from developing a 
Collaborative Science Assessment Prototype. Computers 
in Human Behavior. 76, (2017). 
DOt:https://doi.org/10.1016/j.chb.2017.04.059. 


Dedoose version 8.0.35 2018. Dedoose: Web applicaiton 
for managing, analyzing, and presenting qualitative and 
mixed method research data. SocioCultural Research 
Consultants, LLC. 


Devlin, J. et al. 2019. BERT: Pre-training of deep 
bidirectional transformers for language understanding. 
NAACL HLT 2019 - 2019 Conference of the North 
American Chapter of the Association for Computational 
Linguistics: Human Language Technologies - 
Proceedings of the Conference (2019). 


Dowell, N.M.M. et al. 2020. Exploring the relationship 
between emergent sociocognitive roles, collaborative 
problem-solving skills, and outcomes: A group 
communication analysis. Journal of Learning Analytics. 
7, 1 (2020). DOI:https://doi.org/10.18608/jla.2020.71.4. 


Dowell, N.M.M. et al. 2019. Group communication 
analysis: A computational linguistics approach for 
detecting sociocognitive roles in multiparty interactions. 
Behavior Research Methods. 51, 3 (2019). 
DOt:https://doi.org/10.3758/s13428-018-1102-z. 


Emara, M. et al. 2021. Examining Student Regulation of 
Collaborative, Computational, Problem-Solving 
Processes in Open-Ended Learning Environments. 
Journal of Learning Analytics. 8, 1 (2021). 
DOt:https://doi.org/10.18608/jla.202 1.7230. 


Felstead, A. and Henseke, G. 2017. Assessing the 
growth of remote working and its consequences for 
effort, well-being and work-life balance. New 
Technology, Work and Employment. 32, 3 (2017). 
DOt:https://doi.org/10.1111/ntwe.12097. 


Fiore, S.M. et al. 2018. Collaborative problem-solving 
education for the twenty-first-century workforce. Nature 
Human Behaviour. 

Flor, M. et al. 2016. Automated classification of 
collaborative problem solving interactions in simulated 
science tasks. (2016). 


Forsyth, C. et al. 2020. Are You Really A Team Player ? 
Profiling of Collaborative Problem Solvers in an Online 


? 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 65 


[26] 


[27] 


[28] 


[29] 


[30] 


[31] 


[32] 


[33] 


[34] 


[35] 


[36] 


[37] 


[38] 


[39] 


[40] 


[41] 


66 


Environment. Proceedings of the 13th International 
Conference on Educational Data Mining, EDM 2020. 
Edm (2020). 


Gerosa, M. et al. 2009. A review of ASR technologies 
for children’s speech. Proceedings of the 2nd Workshop 
on Child, Computer and Interaction, WOCCI ’09 
(2009). 


Graesser, A.C. et al. 2018. Advancing the Science of 
Collaborative Problem Solving. Psychological Science 
in the Public Interest. 19, 2 (2018), 59-92. 
DOLhttps://doi.org/10.1177/1529 100618808244. 


Griffin, P. et al. 2012. The changing role of education 
and schools. Assessment and teaching of 21st century 
skills. 


Hao, J. et al. 2017. CPS-Rater: Automated Sequential 
Annotation for Conversations in Collaborative Problem- 
Solving Activities. ETS Research Report Series. 2017, 1 
(2017). DOL:https://doi.org/10.1002/ets2.12184. 


Hesse, F. et al. 2015. A Framework for Teachable 
Collaborative Problem Solving Skills. Assessment and 
Teaching of 21st Century Skills. 


Howard, J. and Ruder, S. 2018. Universal language 
model fine-tuning for text classification. ACL 2018 - 
56th Annual Meeting of the Association for 
Computational Linguistics, Proceedings of the 
Conference (Long Papers) (2018). 

Huang, K. et al. 2019. Identifying collaborative learning 
states using unsupervised machine learning on eye- 
tracking, physiological and motion sensor data. EDM 
2019 - Proceedings of the 12th International Conference 
on Educational Data Mining (2019). 

IBM Watson: 
https://www.ibm.com/watson/services/speech-to-text/. 
Accessed: 2021-03-02. 


Kerr, D. et al. 2016. The In-Task Assessment 
Framework for Behavioral Data. The Handbook of 
Cognition and Assessment. 

Kim, Y.J. and Shute, V.J. 2015. Opportunities and 
Challenges in Assessing and Supporting Creativity in 
Video Games. Video Games and Creativity. 

Kniffin, K.M. et al. 2020. COVID-19 and the 
Workplace: Implications, Issues, and Insights for Future 


Research and Action. American Psychologist. (2020). 
DOt:https://doi.org/10.1037/amp00007 16. 


Koenig, J.A. 2011. Assessing 21st Century Skills: 
Summary of a Workshop. 
Lai, E. et al. 2017. Skills for today: What We Know 


about Teaching and Assessing Collaboration. 


Liu, L. et al. 2015. A tough nut to crack: Measuring 
collaborative problem solving. Handbook of Research 
on Technology Tools for Real-World Skill Development. 


McKight, P.E. and Najab, J. 2010. Kruskal-Wallis Test. 
The Corsini Encyclopedia of Psychology. 


Mislevy, R.J. et al. 2003. Focus Article: On the 


[42] 


[43] 


[44] 


[45] 


[46] 


[47] 


[48] 


[49] 


[50] 


[51] 


[52] 


[53] 


[54] 


[55] 


Structure of Educational Assessments. Measurement: 
Interdisciplinary Research & Perspective. 1, 1 (2003). 
DOtLhttps://doi.org/10.1207/s15366359mea0101_02. 


Miura, G. and Okada, S. 2019. Task-independent 
multimodal prediction of group performance based on 
product dimensions. JCMI 2019 - Proceedings of the 
2019 International Conference on Multimodal 
Interaction (2019). 


Miller, P. et al. 2018. Detecting low rapport during 
natural interactions in small groups from non-verbal 
behaviour. International Conference on Intelligent User 
Interfaces, Proceedings IUI (2018). 


Murray, G. and Oertel, C. 2018. Predicting group 
performance in task-based interaction. JCMI 20/8 - 
Proceedings of the 2018 International Conference on 
Multimodal Interaction (2018). 


Narayanan, S. and Potamianos, A. 2002. Creating 
conversational interfaces for children. JEEE 
Transactions on Speech and Audio Processing. 10, 2 
(2002). DOL:https://doi.org/10.1109/89.985544. 


O’Neil, H.F.. C.G.K.W.K.. B.R.S. 1995. Measurement 
of teamwork processes using computer simulation (CSE 
Tech. Rep. No. 399). 


O’neil, H.F. et al. 2010. Computer-based feedback for 
computer-based collaborative problem solving. 
Computer-Based Diagnostics and Systematic Analysis 
of Knowledge. 


OECD 2013. PISA 2012 Assessment and Analytical 
Framework: Mathematics, reading, science, problem 
solving and financial literacy. 


OECD 2015. Pisa 2015 Collaborative Problem Solving 
Framework. (2015). 


Olsen, J.K. et al. 2020. Temporal analysis of multimodal 
data to predict collaborative learning outcomes. British 
Journal of Educational Technology. 51, 5 (2020). 
DOt:https://doi.org/10.1111/bjet. 12982. 


Oviatt, S. and Cohen, A. 2013. Written and multimodal 
representations as predictors of expertise and problem- 
solving success in mathematics. ICMI 2013 - 
Proceedings of the 2013 ACM International Conference 
on Multimodal Interaction (2013). 


Pedregosa, F. et al. 2011. Scikit-learn: Machine learning 
in Python. Journal of Machine Learning Research. 12, 
(2011). 


Potamianos, A. and Narayanan, S. 2003. Robust 
Recognition of Children’s Speech. IEEE Transactions 
on Speech and Audio Processing. 11, 6 (2003). 
DOtLhttps://doi.org/10.1109/TSA.2003.8 18026. 


Prata, D.N. et al. 2009. Detecting and understanding the 
impact of cognitive and interpersonal conflict in 
computer supported collaborative learning 
environments. EDM ’09 - Educational Data Mining 
2009: 2nd International Conference on Educational 
Data Mining (2009). 


Reilly, J.M. and Schneider, B. 2019. Predicting the 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


[56] 


[57] 


[58] 


[59] 


[60] 


[61] 


[62] 


[63] 


[64] 


[65] 


[66] 


[67] 


quality of collaborative problem solving through 
linguistic analysis of discourse. EDM 20/9 - 
Proceedings of the 12th International Conference on 
Educational Data Mining (2019). 


Robin, X. et al. 2011. pROC: An open-source package 
for R and S+ to analyze and compare ROC curves. BMC 
Bioinformatics. 12, (2011). 
DOt:https://doi.org/10.1186/1471-2105-12-77. 


Roschelle, J. and Teasley, $.D. 1995. The Construction 
of Shared Knowledge in Collaborative Problem Solving. 
Computer Supported Collaborative Learning. 


Rosé, C. et al. 2008. Analyzing collaborative learning 
processes automatically: Exploiting the advances of 
computational linguistics in computer-supported 
collaborative learning. International Journal of 
Computer-Supported Collaborative Learning. 3, 3 
(2008). DOLhttps://doi.org/10.1007/s11412-007-9034-0. 


Samrose, S. et al. 2018. CoCo: Collaboration Coach for 
Understanding Team Dynamics during Video 
Conferencing. Proceedings of the ACM on Interactive, 
Mobile, Wearable and Ubiquitous Technologies. 1, 4 
(2018). DOLhttps://doi.org/10.1145/3161186. 


Schulze, J. and Krumm, S. 2017. The “virtual team 
player”: A review and initial model of knowledge, skills, 
abilities, and other characteristics for virtual 
collaboration. Organizational Psychology Review. 7, 1 
(2017). 
DOt:https://doi.org/10.1177/2041386616675522. 


Schuster, M. and Nakajima, K. 2012. Japanese and 
Korean voice search. ICASSP, [EEE International 
Conference on Acoustics, Speech and Signal Processing 
- Proceedings (2012). 


Shute, V.J. et al. 2013. Assessment and learning of 
qualitative physics in Newton’s playground. Journal of 
Educational Research. 106, 6 (2013). 
DOt:https://doi.org/10.1080/00220671.2013.832970. 


Shute, V.J. 2008. Focus on formative feedback. Review 
of Educational Research. 78, 1 (2008). 
DOLhttps://doi.org/10.3102/0034654307313795. 


Spada, H. et al. 2005. A new method to assess the 
quality of collaborative process in CSCL. Computer 
Supported Collaborative Learning 2005: The Next 10 
Years - Proceedings of the International Conference on 
Computer Supported Collaborative Learning 2005, 
CSCL 2005 (2005). 


Stewart, A.E.B. et al. 2019. Isay, you say, we say: 
Using spoken language to model socio-cognitive 
processes during computer-supported collaborative 
problem solving. Proceedings of the ACM on Human- 
Computer Interaction. 3, CSCW (2019). 
DOt:https://doi.org/10.1145/3359296. 


Stewart, A-E.B. et al. 2021. Multimodal modeling of 
collaborative problem-solving facets in triads. User 
Modeling and User-Adapted Interaction. (2021). 
DOt:https://doi.org/10.1007/s11257-021-09290-y. 


Subburaj, S.K. et al. 2020. Multimodal, Multiparty 


[68] 


[69] 


[70] 


[71] 


[72] 


[73] 


Modeling of Collaborative Problem Solving 
Performance. [CMI 2020 - Proceedings of the 2020 
International Conference on Multimodal Interaction 
(2020). 


Sun, C. et al. 2020. Towards a generalized competency 
model of collaborative problem solving. Computers and 
Education. 143, (2020). 
DOt:https://doi.org/10.1016/j.compedu.2019.103672. 


Suresh, A. et al. 2021. Using Transformers to Provide 
Teachers with Personalized Feedback on their 
Classroom Discourse: The TalkMoves Application. 
AAAI Spring Symposium Series 2021. 

Vaessen N 2019. Word error rate for automatic speech 
recognition. https: //pypi.org/project/jiwer/. 

Vaswani, A. et al. 2017. Attention is all you need. 
Advances in Neural Information Processing Systems 
(2017). 

Vrzakova, H. et al. 2020. Focused or stuck together: 
Multimodal patterns reveal triads’ performance in 
collaborative problem solving. ACM International 
Conference Proceeding Series (2020). 

Wolf, T. et al. 2019. Transformers: State-of-the-art 
natural language processing. arXiv. 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 67 


Just a Few Expert Constraints Can Help: Humanizing 
Data-Driven Subgoal Detection for Novice Programming 


Samiha Marwan, Yang Shi, lan Menezes, Min Chi, Tiffany Barnes, Thomas W. Price 
North Carolina State University, Raleigh, NC, USA 
samarwan, yshi26, ivmeneze, mchi, tmbarnes, twprice@ncsu.edu 


ABSTRACT 


Feedback on how students progress through completing sub- 
goals can improve students’ learning and motivation in pro- 
gramming. Detecting subgoal completion is a challenging 
task, and most learning environments do so either with expert- 
authored models or with data-driven models. Both models 
have advantages that are complementary — expert models 
encode domain knowledge and achieve reliable detection but 
require extensive authoring efforts and often cannot capture 
all students’ possible solution strategies, while data-driven 
models can be easily scaled but may be less accurate and 
interpretable. In this paper, we take a step towards achiev- 
ing the best of both worlds — utilizing a data-driven model 
that can intelligently detect subgoals in students’ correct 
solutions, while benefiting from human expertise in edit- 
ing these data-driven subgoal rules to provide more accu- 
rate feedback to students. We compared our hybrid “hu- 
manized” subgoal detectors, built from data-driven subgoals 
modified with expert input, against an existing data-driven 
approach and baseline supervised learning models. Our re- 
sults showed that the hybrid model outperformed all other 
models in terms of overall accuracy and Fl1-score. Our work 
advances the challenging task of automated subgoal detec- 
tion during programming, while laying the groundwork for 
future hybrid expert-authored/data-driven systems. 


Keywords 
Subgoals, Formative feedback, Data-driven hybrid models 


1. INTRODUCTION 


Formative feedback has been shown to be an effective form 
of automated feedback that can improve students’ learning 
and motivation [54, 38, 8, 20]. In programming, immediate 
formative feedback during problem-solving is important be- 
cause some problems require students to find one of many 
correct solutions [16], and novices may be uncertain about 
when they are on the right track [62]. This uncertainty may 
lead some students to give up [43], and can also negatively 
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impact student confidence and motivation in computer sci- 
ence (CS) [32]. Prior research has shown that immediate 
feedback can address this, reducing novices’ uncertainty and 
improving their confidence, engagement, motivation, and 
learning [42, 8, 33, 21, 38, 39]. 


One effective form of immediate feedback is subgoal feedback 
[40], which indicates students’ progress on specific sub-steps 
of the problem. Feedback on subgoals offers special advan- 
tages because it demonstrates how a student can break down 
a problem into a set of smaller sub-tasks; which is a key to 
simplifying the learning process [37, 38], and can promote 
students’ retention in procedural domains [35]. To generate 
such feedback, learning environments need to be able to do 
subgoal detection, which is the process of detecting when a 
student completes a key objective or sub-part of a program- 
ming task (e.g. receiving and validating user input). How- 
ever, subgoal detection during problem-solving is known to 
be extremely challenging because it requires assessing stu- 
dents’ intended problem solving approach rather than their 
program output. In other words, it is difficult to automati- 
cally evaluate whether a student completed a subgoal in the 
middle of problem-solving due to the many possible strate- 
gies that students can approach to solve a problem, even 
when using test cases or autograders. 


Historically, to provide feedback on subgoals, learning envi- 
ronments have used expert-authored models, where human 
experts encode a set of rules to predict solution strategies 
that students might perform to complete a specific subgoal. 
While expert models can generate accurate feedback with 
interpretable explanations, they also require extensive hu- 
man participation particularly for open-ended programming 
tasks, where it becomes unmanageable to capture every pos- 
sible correct solution [59]. More recently, data-driven (DD) 
models, where the model learns rules from historical data, 
have become more prominent models. This is because DD 
models reduce the expert-authoring burden, and have the 
potential to be easily scaled to more problems and contexts. 
Moreover, DD models learn from multiple students’ solu- 
tions, which makes it capture code situations that human 
experts cannot easily perceive, particularly in open-ended 
programming tasks [49, 45]. However, DD models are de- 
pendent on the quality of the data, and may have lower accu- 
racy than expert models [59, 48]; and, therefore, may some- 
times provide misleading feedback in practice [48]. Both 
of these models have strengths and weaknesses, and in this 
paper we propose an approach that takes advantage of both. 
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We present a hybrid approach that leverages both a data- 
driven model and expert insights to detect subgoal com- 
pletion during problem-solving block-based programs. Our 
hybrid model is based on three main steps. First, we used 
an unsupervised data-driven model to generate subgoal de- 
tectors for a programming task, and represent them as a 
set of human-interpretable and human-editable rules. Sec- 
ond, this representation allowed us to evaluate the accuracy 
of the subgoal detectors; particularly when they have inac- 
curate detections. Third, we used human expert insights to 
refine and fix rules that led to inaccurate subgoal detections. 
These three steps resulted in a hybrid data-driven approach 
that generates subgoal detectors with high accuracy, that 
target expert-authoring effort only where improvements are 
needed, and that can also be easily scaled to various prob- 
lems and contexts. 


We evaluated our hybrid data-driven model to a block-based 
programming problem from an introductory CS classroom 
against the same, fully data-driven (DD) model, but without 
experts’ intervention. We evaluated the accuracy of both 
models by comparing their subgoal detections on a given 
programming task, to that of human experts, and we hy- 
pothesize that our hybrid model will surpass the accuracy 
achieved by the DD model. We found that the expert eval- 
uations of subgoal detections achieved significantly higher 
agreement with our hybrid model than that achieved with 
the DD model. We also found that our hybrid model out- 
performs the state-of-the-art supervised models: code2vec 
[53], Support Vector Machine (SVM) [14], and XGBoost [9]. 
In addition, we present case studies of how the hybrid model 
led to differing subgoal detections in student programs com- 
pared to the DD model. We also discuss how we can close 
the loop by applying our hybrid model in block-based pro- 
gramming classrooms to provide students with immediate 
feedback on subgoals. 


In summary, in this paper we investigate this research ques- 
tion: RQ: How well does a hybrid data-driven model com- 
bined with expert insights perform compared to: 1) a data- 
driven model without expert augmentation and 2) baseline 
supervised learning approaches that leverage expert subgoal 
labels? Our work provides the following contributions: (1) 
we present a hybrid subgoal detection approach which com- 
bines an unsupervised data-driven model with domain ex- 
pertise to achieve the benefits of both data-driven models, 
and expert models, and (2) we demonstrate how our hybrid 
approach advances the state of the art in subgoal detection 
in open-ended programming tasks over supervised and un- 
supervised baselines, with an accuracy range of 0.80 - 0.92. 


2. LITERATURE REVIEW 


In this work, we investigate the challenge of automatically 
detecting subgoals effectively. We propose a method that 
involves a hybrid approach where human experts modify 
data-driven models to build effective subgoal detectors. In 
the following, we first review prior work on the immediate 
feedback with a focus on subgoal feedback. Then, we review 
prior work that involves merging machine and human ex- 
pert intelligence to improve performance of machine learned 
models. Finally, we review both state-of-the-art supervised 
learning models and an unsupervised data-driven model that 
we used for subgoal detection in a programming task. 


2.1 Feedback and Subgoal Detection 


Formative feedback is defined as a type of task-level feed- 
back that provides specific, timely information to a student 
in response to a particular problem, based on the student’s 
current ability [54]. In a review of effective formative feed- 
back in education, Shute shows that immediate formative 
feedback is effective because it can improve students’ learn- 
ing {11, 38] and retention [54, 44], particularly in procedural 
skills such as programming [54]. Most intelligent tutoring 
systems provide immediate feedback through identifying er- 
rors (e.g. error detectors [5, 55], anomaly detectors [31], or 
misconception feedback [25, 24]); however far less work has 
been devoted to providing feedback on students’ subgoals. 
Automated assessment systems (i.e. autograders) can pro- 
vide feedback on correct subgoals by showing the passing 
test cases using expert-authored models [7, 28, 26, 38]. For 
example, most autograders use instructor test cases to check 
for correct program output; however they require students 
to submit an almost-complete program to obtain feedback 
[7, 29]. Asa result, this feedback is often not available in the 
early stage of programming when timely feedback on sub- 
goals is mostly needed. To provide timely subgoal feedback, 
prior research has taken two exclusive approaches: expert- 
authored approach and data-driven approach. 


Expert-authored Approaches: Prior work has explored stu- 
dent completion to subgoals by diagnosing students’ solu- 
tions against expert models (e.g. constraint-based mod- 
els [42]), even when a student has incomplete submissions, 
to provide feedback on whether they are on track [27], or 
whether they completed key objectives of short program- 
ming tasks [38]. However, these systems often require ex- 
tensive expert effort to create rules. To address this author- 
ing burden, example-tracing tutoring systems infer tutoring 
rules based on examples of potential student behaviors. This 
still requires an author with some domain expertise, but 
it allows rules to be constructed by non-programmers who 
have domain expertise [2]. An expert can create different ex- 
ample solutions to capture different solution strategies; and 
augment them with hints or feedback. Example-tracing tu- 
tors have been developed in multiple non-programming do- 
mains like genetics [12], mathematics [1], and applied ma- 
chine learning, and they have been shown to improve the 
problem-solving process and student learning [2]. Despite 
the accuracy of expert models in providing feedback on test 
cases or correct features, which can be equivalent to sub- 
goals, it is unclear how feasible they are in domains with 
vast solution spaces and open-ended problems, such as in 
programming tasks [59, 39, 25]. 


Data-driven Approaches: Data-driven approaches refers to 
systematically collecting and analyzing various types of ed- 
ucational data, to guide a range of decisions to help improve 
the success of students [15, 50]. Data-driven models largely 
avoid the need for expert authoring altogether by using prior 
students’ correct solutions, instead of expert rules or instruc- 
tor solutions, to learn patterns of correct solutions. This en- 
ables automated assessment feedback on student code [21]. 
Many data-driven models work by executing a comparison 
function that calculates the distance between students’ code 
and all the possible correct solutions, and then compares 
the path of the most close solution with that of the stu- 
dent [47, 49, 59, 39]. While most data-driven methods have 
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been used to generate fine-grained feedback — such as hints 
in the hint factory [58], little work has used these meth- 
ods to generate subgoal feedback. For example, Diana et 
al. developed a model that searches for meaningful code 
chunks in student code to generate data-driven rubric crite- 
ria to automatically assess students’ code [19]. Diana et al 
show that a data-driven model can have agreement with that 
of the experts [19]. In the iList tutor, Fossati et al. used 
a data-driven model to provide feedback on correct steps, 
where they assess student code edits as good, if it improves 
the student’s probability to reach the correct solution, or 
uncertain if a student spent more time than that taken by 
prior students at this point [21]. Fossati et al. found that 
this feedback was well-perceived by students, and improved 
their learning [21]. However, data-driven models are depen- 
dent on the similarity of the current student’s approach to 
prior student submissions, making it difficult to control the 
quality of their feedback [39, 51, 59, 46]. In a case study 
paper, Shabrina et al. discuss the practical implications of 
data-driven feedback on subgoals, showing that the quality 
of feedback is important, even positive feedback, since inac- 
curate feedback can cause students to spend more time on 
a task even after they were done [51]. Because of such chal- 
lenges and perhaps other reasons such as the inability for 
individual instructors to augment autograder feedback, few 
tools have been built to provide immediate feedback to stu- 
dents on whether they have achieved subgoals during their 
programming tasks [34]. 


2.2 Integrating Expert Knowledge into Mod- 


els 

In recent years, combining machine and human intelligence 
has been extensively explored in a wide range of domains in- 
cluding artificial intelligence and software engineering. For 
example, Chou et al developed a virtual teaching assistant 
(VTA) that uses teachers’ answers as human intelligence, 
and machine intelligence to use teachers’ answers to locate 
student errors, and generate hints [10]. Chou et al found 
that these mechanisms reduce teacher tutoring load and re- 
duce the complexity of developing machine intelligence [10]. 
In the software engineering domain, there is an emerging ap- 
proach called collective intelligence that merges the wisdom 
of multiple developers with program synthesis algorithms, 
which has been shown to significantly improve the efficiency 
and accuracy of program synthesis [61]. 


Human-in-the-loop methods are another effective trend that 
use human intelligence to improve the efficiency of Machine 
Learning models, while the model is learning [23, 56, 30, 
63]. For example, Goecks introduces a theoretical founda- 
tion that incorporates human input modalities, like demon- 
stration or evaluation, to leverage the strengths and mitigate 
the weaknesses of reinforcement learning (RL) algorithms 
[23]. Their results show that using human-in-the-loop meth- 
ods accelerates the learning rate of RL models, with a more 
efficient sample, in real time [23]. Our work also uses human 
intelligence to improve the accuracy of a machine learning 
model; however, it does so after the model is trained. 


In the educational domain, expert knowledge is widely ap- 
plied to augment data-driven and machine-learned models 
for problem solving and feedback. For example, in a logic tu- 
tor that provides data-driven hints using students’ solutions, 


Stamper et al. used an initial small amount of sample data 
generated by human experts to enhance the automatic de- 
livery of hints [57]. Moreover, example-tracing tutors allow 
experts to specify moderately-branching solutions for open- 
ended problems, allowing some intelligent tutors originally 
implemented using complex expert systems to be almost 
completely replicated to support practical learning needs [2]. 


2.3 Supervised Learning Models of Code Anal- 
ysis 

Supervised learning algorithms leverage labels created by 
human experts, to guide the model search process. With 
available labels, automated learning algorithms can be ap- 
plied to the subgoal detection tasks for programming data. 
As shown in [6], one baseline is to extract term frequency- 
inverse document frequency (TF-IDF) features and uses tra- 
ditional machine learning algorithms such as support vector 
machines (SVM) [14] and XGBoost [9]. However, as word- or 
token-based features such as TF-IDF lose important struc- 
tural information from programming data [3], recent work 
uses structural representations from code and a more com- 
plex model structure to learn more complex features. For 
example, Shi et al. applied a code2vec [4] model to detect 
the completion of rubrics on student programming data [53]. 
In this work, we compared our hybrid data-driven model to 
these existing supervised learning baseline models to check 
our improvement on the subgoal detection task. 


2.4 Data-Driven Subgoal Detection Model 


Among the various data-driven models for detecting sub- 
goals, or rubric items [39, 18, 19], we built our proposed hy- 
brid model on top of an unsupervised data-driven subgoal 
detection (DD) algorithm, presented in [64]. We applied 
this algorithm on a programming task called Squiral (de- 
scribed in Section 3) by running the following steps. First, 
the algorithm extracts prior student solutions in Squiral as 
abstract syntax trees (ASTs) [49, 47]. Then, it applies hier- 
archical code clustering for feature extraction and selection 
by: (1) extracting common code shapes, which are common 
subtrees, in ASTs of correct students’ solutions (Figure 2 
shows examples of code shapes); (2) filtering redundant code 
shapes; (3) identifying decision shapes, which consist of a 
disjunction of code shapes (ie. Cy A... A Cn) that are 
usually mutually-exclusive (e.g. a program uses a loop, or 
a repeated set of commands, but not both), and (4) hierar- 
chically clustering frequently co-occurring code shapes into 
subgoals. In [64], the authors found a meaningful Cohen’s 
Kappa (0.45) in agreement of the algorithm and expert sub- 
goal detection on student data, suggesting that DD subgoals 
could be used to generate feedback. However, since the DD 
subgoals are typically represented in a regular-expression- 
like form, labels are needed to make them meaningful and 
usable for students in programming environments. 


3. METHOD 


This work presents and evaluates a hybrid data-driven model 
for generating and detecting subgoals in a block-based pro- 
gramming exercise (explained in Section 3.1). To evaluate 
our hybrid data-driven approach, we applied our model on 
student data collected from an Introduction to Computing 
course, and we asked human experts to evaluate the accu- 
racy of its subgoal detections (explained in Section 3.2.2). 
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Figure 1: One possible solution for Squiral with line numbers 
on the left, and the script’s output on the right. 


We use this dataset to provide examples of how our ap- 
proach works, but we also discuss how it can be generalized. 
We compared our hybrid model against its underlying data- 
driven (DD) model described above in Section 2.4, as well as 
state-of-the-art supervised learning models (explained below 
in Section 4.1). 


3.1 Dataset 


Our data is collected from a CSO course for non-majors in 
a public university in the United States that uses Snap/, a 
block-based programming environment [22]. This program- 
ming environment logs all students actions while program- 
ming (e.g. adding, deleting or running blocks of code) with 
the time taken for each step. These student logs (i.e. ac- 
tions) can also be replayed as a trace of code snapshots of 
all students’ edits — allowing us to detect the time and the 
code snapshot where a subgoal is completed during student 
problem-solving process. 


In this work, we collected students’ logs from one homework 
exercise named Squiral, derived from the BJC curriculum 
[22]. Squiral requires a visual output, where students are 
asked to write a procedure that takes one input ‘r’ to draw 
a square-like spiral with r rotations. Figure 1 shows a pos- 
sible correct solution of Squiral that requires procedures, 
variables, and loops. We collected a training dataset from 3 
semesters: Spring 2016 (S16), Fall 2016 (F16), and Spring 
2017 (S17), which includes data of 174 students, that has a 
total log data of 29,574 student actions 


3.2 Hybrid Data-Driven Subgoal Detection 
Our hybrid data-driven model is based on two main things. 
First, the DD model is used to generate data-driven sub- 
goals. Second, expert-authored constraints are added to 
improve the quality of these subgoals and the accuracy of 
their detection. We implemented our hybrid approach by 
conducting the following 3 high-level steps: 


1. We used the DD model to generate an initial set of sub- 
goals, consisting of clusters of code shapes. We then 
presented these clusters to experts in an interpretable 
form, who combined them to create a concrete set of 
meaningful subgoal labels. 


2. We integrated DD subgoal detection model into the 
students’ programming environment, allowing students 
to see when the DD algorithm detected completion of 
each subgoal. We then collected student programming 
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Figure 2: Three code shapes A, B, C, in both the data-driven 
and hybrid models. Each code shape represents a false de- 
tection and its fix by human experts. Red dashed nodes are 
removed and green bold nodes are added. 


log data along with DD detections, and asked experts 
to evaluate the accuracy of the DD detections. 


3. We used human expert insights to fix code shapes that 
led to false subgoal detections; and then combined 
them again to create a modified set of hybrid subgoals 
and evaluated its new accuracy. 


3.2.1 Step 1: Interpreting and Editing Data-Driven 
Subgoals 


The goal of this step is to generate data-driven subgoals 
using the DD model and present them in an interpretable 
and editable form. We applied the DD model (described 
in Section 2.4) on S16, F16, and S17 student datasets to 
generate a number of clustered code shapes. Column 1 in 
Table 1 shows the description of 7 subgoals corresponding 
to code-shape clusters generated from correct solutions (n 
= 52). We evaluated each cluster by displaying its code 
shapes separately and interpreted their code behavior. For 
example, code shape A in Figure 2, on the left, represents a 
decision shape that requires student to use the ‘ReceiveGo’ 
block (i.e. the hat block in Figure 1, line 1, which is used to 
start a script) in their main script, or to evaluate a procedure 
with one parameter, which is done by creating and snapping 
a procedure in the main script (as shown in Figure 1, line 3). 
We treated each cluster as a subgoal, and for a subgoal to be 
detected, the DD model requires all of its code shapes, and 
exactly one component of its decision shapes, to be present 
in student code. 


While the data-driven clusters can represent appropriate 
subgoals, we combined some of them to create a shorter 
list of higher-level subgoals similar to the programming task 
rubric. Column 2 in Table 1 show the combined subgoals. 
It is worth noting that we also took the insights of two in- 
structors of the CSO course on how meaningful these sub- 
goals are for students to understand. Additionally, because 
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one of our goals is to use these data-driven subgoals in learn- 
ing environments, we asked human experts to label them to 
make them easily understandable by students, and instruc- 
tors. For example, the human label of subgoal 1 is: “Create 
a procedure with one parameter, and use it in your code”, as 
shown in Table 1. We then developed a data-driven subgoal 
detector that takes student code as an input, and outputs 
the status of each subgoal. For example, if the input is code 
C and the output is {1, 0, 0, 1}, this means the subgoal 
detector detects the completion of subgoals 1 and 4, and the 
absence of subgoals 2 and 3 in C. 


3.2.2 Step 2: Investigating Data-Driven Subgoals 
The goal of this step is to investigate the correctness of the 
DD subgoal detections. In Spring 2020 (S20), we integrated 
the DD subgoal detector into the Snap! programming en- 
vironment for the Squiral exercise to provide students with 
subgoal feedback [39]. In Snap/, students could see the sub- 
goal labels (shown in Table 1), colored gray to start, green 
when the subgoal was detected, or red when it became bro- 
ken. We collected 4,480 edit logs from 27 student submis- 
sions with an average of 166 edits per student in 520. For 
each student edit, we recorded the DD subgoal detection 
state (e.g. {1, 0, 0, 0} for subgoal 1 being complete). We 
asked human experts to manually replay each student’s trace 
data and evaluate whether the subgoal labels, as shown to 
students, were achieved at the specific timestamps when the 
DD algorithm detected them. Importantly, we asked ex- 
perts to report when the expert-authored labels for each 
subgoal (which students saw) were achieved. Since these 
labels do not precisely match the code shape combinations 
that the DD subgoal detector used, it was very possible for 
the DD model to be “wrong.” In other words, we asked 
experts to determine when each student achieved “Create 
a Squiral procedure with one parameter and use it in your 
code,” and compared that to when the DD detector marked 
this subgoal label as complete. 


We classified each evaluated instance as either Early, on- 
Time, or Late. An instance is classified as Early if the DD 
detection is before the human expert detection timestamp, 
OnTime if they coincide, and otherwise Late. For example, if 
for student S;, the human expert detected the completion of 
subgoal i (S'G;) at time t = 5; while the algorithm detected 
it at t = 2, then we label SG; for S; as Early detection. 
Then we sorted students in descending order based on the 
percentage of false detections in their log data, and we took 
the first 66% of this data (~ 18 students as a data sample) 
to investigate the reasons for false detections. We did not 
use the full set of false detections, since our primary goal was 
to fix the most common mismatches, without overfitting to 
the dataset. 


We then focus on false detections that occured due to new, 
correct solutions, in the S20 dataset, that had no match- 
ing code shapes in the training dataset (S16, F16, and S17 
datasets). We do not investigate expected false detections 
that occured due to known limitations in the DD algorithm 
(e.g. the DD algorithm does not differentiate between vari- 
able names). 


We found three reasons for false detections for subgoals 1, 
2, and 4. Inspired by the design of the constraint-based 


SqlTutor by Mitrovic et al. [41], we introduce 3 fixes (or 
constraints) to resolve them. 


Subgoal 1 false detection. As shown in Table 1, subgoal 1 la- 
bel requires a student to create a procedure with one param- 
eter, and use it (or evaluate it) in the main script. However, 
subgoal 1 code shapes consist of the creation and evaluation 
of a procedure, or the use of a ‘ReceiveGo’ block (the hat 
block used to start a script). This means that whether a 
student created and evaluated a procedure, or added a ‘Re- 
ceiveGo’ block in the main script, the DD model will detect 
the completion of that subgoal, but experts did not inter- 
pret the ‘ReceiveGo’ block as meeting this subgoal, yielding 
a false detection. To fix this false detection, we simply re- 
moved the ‘ReceiveGo’ block as an option for this subgoal. 
Code shape A in Figure 2 shows the code shapes of subgoal 
1 of the DD model (on the left), and how it is fixed in the 
hybrid model (on the right). 


Subgoal 2 false detection. As shown in line 6 in Figure 1, 
subgoal 2 requires a student to use a ‘repeat block that iter- 
ates 4 times the number of rotations to draw a Squiral with 
the correct number of sides. While code shapes of this sub- 
goal satisfy this definition, they also include a code shape of 
adding a ‘pen down’ block, which is necessary to draw, but 
only inside a procedure. Therefore, if a ‘pen down’ block 
is used outside of a procedure, subgoal 2 will not be de- 
tected. To fix this false detection, we added a disjunction 
code shape to detect the presence of ‘pen down’ inside or 
outside a procedure, as shown in code shape B in Figure 2. 


Subgoal 4 false detection. As shown in lines 6-9 in Figure 1, 
subgoal 4 requires the use of ‘move’, ‘turn’, and ‘change - 
by -’ blocks (which increments a value of a variable), inside 
a ‘repeat block’. We found that code shapes of subgoal 4 
only include the ‘turnLeft’ block; however, if the student 
solution contains a ‘turnRight’ block (which does the same 
‘turn’ functionality but from a different direction), the sub- 
goal will not be detected. To fix this false detection, we 
modified all the code shapes that require the use of ‘turn- 
Left’ block to accept either ‘turnRight’ or ‘turnLeft’ 
blocks, as shown in code shape C in Figure 2. 


These three false detections show that prior solutions in $16, 
F16, and $17 datasets often used a ‘ReceiveGo’ and ‘turn- 
Left’ blocks, and used ‘pen down’ inside a procedure; but 
this was not always the case in the $20 data. This shows that 
investigating the accuracy of a model, either data-driven or 
expert, is necessary since it is impossible to predict how stu- 
dents will behave in practice or how the data will change 
from one class to another. 


3.2.3 Step 3: Improving the Data-Driven Subgoals 


with Human Insights 
The goal of this step is to apply the human expert con- 
straints (explained in Step 2) to mitigate the false detections 
of the DD algorithm. To do so, we developed a tool that 
iterates over each code shape of the data-driven subgoals, 
and allows humans to edit code shapes (i.e. add, delete or 
modify) to apply the three constraints (i.e. fixes) explained 
in Step 2. Because this tool modifies the code shapes, we 
then use the original DD algorithm to re-cluster all code 
shapes to ensure that the most similar code shapes remain 
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Table 1: Data-driven subgoals (composed of code shape clusters) with their corresponding human labels that were used when 


presented to students. 


Data-Driven Code Shape Clusters 


Combined Clusters 


Subgoal Human Label 


C1: Evaluate a procedure with one parameter on 
the script area OR Add a ‘ReceiveGo’ block. Cl 4 
C2: Create a procedure with one parameter. 


+ C2 = Subgoal 1 


Create a Squiral procedure with 
one parameter and use it in your code. 


C3: Have a ‘multiply’ block with a variable in 


a ’repeat’ block OR two nested ‘repeat’ blocks. C3 4 
C4: Have a ‘pen down’ block inside a procedure 


+ C4 = Subgoal 2 


The Squiral procedure rotates 
the correct number of times. 


C5: Add a variable in a ‘move’ block inside a 
‘repeat‘ block. 


C5 = Subgoal 3 


The length of each side of the Squiral 
is based on a variable. 


C6: Have a ‘move’ and ‘turnLeft’ block 
inside a ’repeat’ block. 

C7: ‘Change’ a variable value inside a 
‘repeat’ block. 


C6 + C7 = Subgoal 4 


The length of the Squiral increases 
with each side. 


clustered together. Figure 2 shows an example of three code 
shapes before and after they have been edited by human ex- 
perts, that fix false detections that existed for subgoals 1, 2, 
and 4, respectively. 


In summary, our hybrid model humanizes data-driven sub- 
goal detection models through a series of important steps. 
First, we apply a data-driven model to correct, historical 
student solutions to generate a set of human interpretable 
code shape clusters. Second, a human expert labels the 
subgoals these clusters represent, in a way that is meant 
to align with the original programming assignment. Third, 
we collect data from students solving the same task using 
a programming environment augmented with subgoal feed- 
back (i.e human labels with colors) based on the DD subgoal 
detector. Fourth, we had experts examine code traces with 
the DD subgoal feedback to determine when the displayed 
subgoal labels were actually achieved. Fifth, human experts 
modified the code shapes that led to discrepancies between 
the data-driven and expert detections for the displayed sub- 
goals. This series of steps leverages the natural cycle of a 
frequently-offered CSO class to bootstrap the creation of DD 
subgoal detectors in the programming environment. 


4. EVALUATION 

In this experiment, we applied both the hybrid and the DD 
models to detect subgoals in students’ $20 code submissions. 
We also asked two human experts to evaluate the presence or 
absence of each subgoal in a subset of students’ code snap- 
shots (sequential states of student code, corresponding to 
their code edits, e.g., the addition or deletion of code blocks) 
using the subgoal labels (shown in Table 1) as rubric items 
(with 1 for the subgoal’s presence and 0 for its absence), 
resulting in an expert (or gold standard) subgoal state. 


Because 520 data consists of 4480 code snapshots, it is not 
feasible to evaluate the models on every timestamp for two 
reasons. First, students mostly need feedback on a given 
subgoal when they are making edits towards finishing that 
subgoal, not after every single edit they make. Second, stu- 
dents break and recomplete subgoals frequently, even when 
they are not working on a particular subgoal, and therefore 
it is not meaningful to have an expert label at every sin- 
gle datapoint. As a result, we evaluate the models at the 


most meaningful times when a student is close to finishing 
a subgoal, including: (1) the first time a student completed 
a subgoal, according to a human expert, (2) up to five code 
edits before that subgoal is completed, and (3) any time 
when either model (i.e. hybrid, or DD) suggests a change 
in a subgoal’s status. While these changes may or may not 
be true, we wanted to have experts evaluate the correctness 
of how the algorithms may have detected subgoal changes 
at these points. 


For each subgoal, two human experts evaluated 150, 163, 
178, and 196 student snapshots for subgoals 1, 2, 3, and 4, 
respectively, making a total of 687 labeled snapshot data- 
points. The experts used the subgoals as their rubric; and 
they started the labelling process by evaluating the first time 
a subgoal is detected. To do so, they divided the data (27 
students * 4 subgoals = 108 datapoint) into a set of 6 rounds, 
where the first round consists of 2 students and the remain- 
ing 5 rounds consists of 5 students. The two experts collabo- 
ratively evaluated the first round together to make sure they 
have a clear understanding of the rubric subgoals. Then for 
the next two rounds, they evaluated the logs independently 
and after each, they met to discuss and resolve conflicts. 
For these 3 rounds, the two experts had a moderate to sub- 
stantial agreement with a Cohen’s kappa ranging from 0.5 
to 0.67. The reason why the kappa values are low is that 
we considered any disagreement even if it was a difference of 
one timestamp, but it does highlight how subgoal detection 
can be subjective, which is a challenge for measuring the ac- 
curacy of subgoal detection. As a result, for the remaining 
data, the two experts continued to evaluate it independently 
and then met to discuss and resolve conflicts to produce rel- 
atively objective gold standard expert labels. We used these 
labelled logs as ground truth to compare the accuracy and 
F1-scores of both the DD and the hybrid models. We also 
calculated the agreement between the expert detections and 
those generated by the hybrid and DD models. 


4.1 Supervised Learning Models 

We also compared our hybrid humanized model with super- 
vised machine learning models as another form of baseline. 
The supervised models were trained and tested on the $20 
dataset, using the same 687 expert-labeled snapshots de- 
scribed above. We trained separate models to detect each of 
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the subgoal labels (using training/testing splits discussed 
below). This allows us to directly compare these super- 
vised methods, which require labeled training data, to the 
DD model, which does not, and to our hybrid approach, 
where the expert uses some of the labeled data to improve 
the model. While we discuss some limitations to this com- 
parison in Section 7, these baselines help contextualize the 
performance of our subgoal detectors. 


The baseline supervised machine learning models we have 
chosen are two shallow models (SVM [14] and XGBoost [9]) 
and one deep learning model (code2vec [4]). We used the 
same edits for predictions as the DD and hybrid models, and 
extracted the term frequency—inverse document frequency 
(TF-IDF) features from the models, thus a vector represen- 
tation of an edit is generated, and used for training the two 
models. We performed grid search cross validation for both 
models. For SVM, we used a parameter space of linear kernel 
and Radial Basis Function (RBF) [60] kernel, and searched 
the regularization parameters from 1 to 10. For XGBoost, 
we searched the subsampling space of 0.1 to 1, with the 
number of estimators from 5 to 100, stepping by 5. Ten- 
fold cross validation is performed to search the parameter 
spaces. The training, validation, and test sets are split by 
students to make sure that no students used for testing will 
have an edit in training, since edits from one student would 
be very similar, and using samples similar to the testing set 
in training would lead to an unfair comparison. 


We selected one state-of-the-art deep learning model, code2vec 


[4], for comparison as well, as the model has recently been 
applied in educational code classification tasks [53]. In- 
stead of using a vector of term frequency to represent edits, 
code2vec uses the structural representations from ASTs to 
represent the code, and the representation is learned from 
training a neural network’. We used early stopping to avoid 
overfitting. To ensure the robustness of our results, we ran 
20 times with resampling for all supervised baseline models, 
and reported average metrics (e.g. F1-score, accuracy). 


5. RESULTS 


RQ1a: How well does a hybrid model perform compared to a 
data-driven model without expert augmentation? 


The prediction results for each subgoal from the DD model 
and the hybrid model are shown in Table 2. Our hybrid 
model achieves higher accuracy and F 1-scores on all subgoals 
than the DD model. In particular, it reaches > 0.8 accuracy 
for all subgoals, and it reaches > 0.9 accuracy for 2 out of 
the 3 subgoals that were modified (i.e. subgoal 1 and 4) 
with expert constraints. It is worth noting that the hybrid 
model achieves higher accuracy in subgoal 3, which was not 
modified with expert constraints. This is possible because 
after we modified code shapes for the other subgoals, we 
reclustered the code shapes (as described in Section 3.2.3), 
and the new code shapes for subgoal 3 were changed. This 
is likely because, after reclustering, some code shapes moved 
to subgoal 3, resulting in higher recall. 


We also measured the agreement between human experts, 
DD, and hybrid model subgoal detections. For the four sub- 


‘We applied code2vec using the process described in [53]. 
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Figure 3: The number of code edits (y-axis) that occurred 
between gold standard expert subgoal detections, and detec- 
tions by the DD and Hybrid models, for first-time subgoal 
detections for each student trace (x-axis). 


goals, we find low to moderate agreement over all student 
logs between DD and human expert detections, with Co- 
hen’s kappa values ranging between 0.25-0.581. However, 
we find substantial or better agreement between the hybrid 
model and human expert detections, with Cohen’s kappa 
values ranging between 0.6-0.84. It is worth noting that this 
agreement is higher than that achieved between the two hu- 
man experts (described in Section 4). These results showed 
that the addition of just three human constraints to a data- 
driven model succeeded in improving its accuracy, making 
the hybrid model agree more with the gold standard (that 
the experts co-constructed), than the experts’ original agree- 
ment with one another. 


We also determined the number of false detections (i.e. Early 
and Late detections, as described in Section 3.2.2) for both 
the hybrid and DD models. We found the DD model de- 
tected 40.66%, 10.66%, 5.62% and 5.10% Early detections, 
and 2%, 13.5%, 12.36%, and 15.31% Late detections for sub- 
goal 1, 2, 3, and 4, respectively. However, our hybrid model 
detected 1.33%, 13.5%, 10.67% and 4.6% Early detections, 
and 8%, 5.52%, 1.7%, and 2.81% Late detections for subgoal 
1, 2, 3, and 4, respectively. To visualize these false detec- 
tions, Figure 3 visualized these false detections by presenting 
the distance (i.e. how many edits) between the Early and 
Late detections by both the DD and hybrid models, and the 
gold standard human expert detections of the first time a 
subgoal is completed. The x-axis presents the students (n = 
27), and the y-axis presents the number of edits a student 
makes until they complete a subgoal. We used a negative 
number to indicate how much earlier the models were than 
the gold standard detection, 0 to show when models agree 
with the gold standard, and a positive number to show how 
much later the models were. We also used empty circles to 
indicate instances where a subgoal is never detected by the 
models but it was detected by human experts. While Fig- 
ure 3 shows only how early/late the model is in detecting 
when a subgoal is first completed, this is likely the most 
important detection. Our results suggest a high agreement 
between the hybrid model’s detections with the gold stan- 
dard, and a strong improvement over the DD model. 
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Table 2: Precision, Recall, F1-score and Accuracy observed with Supervised, Data-Driven (DD) and our Hybrid Models. 


Precision Recall F1-score Accuracy 
XG- |Code Hyb- XG- |Code Hyb- XG- |Code Hyb- XG- |Code Hyb- 
avi Boost | 2vec Pe rid evel Boost | 2vec on rid ae Boost | 2vec PD rid evil Boost | 2vec fae rid 

ee 0.71 J0.68 |0.89 |0.44/0.95 |0.79 |0.75 |0.91 |0.94/0.76 |0.71 |0.66 |0.89 |0.60/0.85 |0.82 |0.76 |0.90 |0.57|0.91 
Subgoal 2 E 
(n = 163) 0.58 J0.61 |0.57 |0.70/0.69 |0.61 |0.77 {0.79 |0.63/0.85 |0.55 |0.66 |0.64 |0.66/0.76 |0.74 |0.75 |0.75 |0.77|0.81 
Subgoal 3 2 
(n = 178) 0.60 )0.62 |0.69 |0.80]0.75 |0.54 |0.61 |0.76 |0.64/0.95 |0.52 |0.59 |0.70 |0.71/0.84 |0.70 |0.72 |0.79 |0.82]0.88 
Subgoal 4 E 
(n = 196) 0.62 /0.64 |0.64 |0.79]0.87 |0.64 |0.75 |0.84 |0.55/0.93 |0.59 |0.66 |0.69 |0.65/0.90 |0.76 |0.74 |0.75 |0.80]0.93 


RQ1b: How does a hybrid model perform compared to super- 
vised learning approaches that leverage expert subgoal labels? 


We show our comparison of supervised learning models and 
the hybrid model in Table 2. On all subgoals, except one, 
the hybrid model has higher accuracy and F1-score than 
code2vec, SVM, and XGBoost models, outperforming them 
by 0.10, 0.06 and 0.15 percent of F1-score, respectively. In 
subgoal 1, we found that code2vec achieved a higher F1- 
score than all the other models, and a relatively similar ac- 
curacy to the hybrid model (0.903, 0.906). One possible 
explanation for this is that subgoal 1 is the simplest sub- 
goal, requiring only that the student has defined and used 
a procedure, regardless of its content, and this simple code 
pattern may have been easier for the supervised approaches 
to learn. These results show that a hybrid model iteratively 
constructed through cycles of student data collection, ma- 
chine learning, along with human labeling and correction 
can be used to create accurate automatic subgoal detections 
on a novice programming task. Furthermore, these super- 
vised learning models, that were mostly outperformed by 
our hybrid model, were learned using labels from snapshots 
that were strategically chosen to reflect important decision 
points for the model, suggesting that the supervised models’ 
performance may suffer if a random selection of snapshots 
were used to create an expert-labeled training set instead. 


5.1 Case Studies 


In this section, we present case studies to highlight ways the 
hybrid model improved upon the original DD model, as well 
as the hybrid model’s limitations. These case studies come 
from the 33% of students who were not investigated when the 
expert identified false detections in $20 from the original DD 
model, as discussed in our methods (Section 3.2.2). These 
students also used the original DD system, but their data did 
not inform our hybrid model. These case studies, therefore, 
help us understand the ways our hybrid model might help 
new students, as well as limitations of the model. Though 
our prior work suggests the DD subgoal detections over- 
all were often useful to students [39], our post-hoc analysis 
here shows that the false detections may have negatively im- 
pacted student programming behavior, suggesting the need 
for our hybrid model’s improvements. 


5.1.1 Case Study I (Em): Inaccurate Data-Driven 
Subgoal Feedback 
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We present here a case study of the student Em? when they 
received an inaccurate subgoal detection based on the DD 
model, and how the hybrid model could have mitigated this 
false detection. 


Em started solving Squiral by snapping the ‘when green flag 
clicked block’ (i.e. ‘ReceiveGo’ block) on the main script as 
shown in Figure 4A, and the system falsely detected sub- 
goal 1. Em then proceeded to work on subgoal 2, without 
creating the required procedure, and created a loop using 
the ‘repeat’ block nested with ‘move’ and ‘turnLeft‘ blocks 
(shown in Figure 4B). This time, the system was correct 
in not detecting subgoal 2 because the loop was not in a 
procedure, and does not iterate on the ‘rotations’ param- 
eter. Afterwards, Em correctly created a procedure with 
one parameter as shown in Figure 4C; however, the system 
shows no change, since it already falsely detected subgoal 
1 earlier, and therefore, no change in the feedback is given 
to the student. Em then destroyed the procedure, with- 
out ever making it again. Em kept working for the rest of 
the time on creating a number of redundant loops, similar 
to the one in Figure 4C, with constant values to manually 
draw the Squiral shape (rather than using a variable to vary 
its length). 


Em spent a total of 55 minutes to draw Squiral in an iterative 
manner. While the DD system accurately detected subgoals 
2-4 as incomplete, this case study highlights potential harm 
that may have arisen from the false detection of subgoal 1. 
When subgoal 1 was detected early, Em skipped over creat- 
ing a procedure. Later, when she did create the procedure 
correctly, she got no additional feedback (since the subgoal 
was already detected), and promptly deleted it. Preventing 
these unneeded deletions is a primary role of correct, pos- 
itive subgoal feedback. However, had Em been using the 
hybrid model, subgoal 1 would not have been detected early 
because the expert edited the faulty code shape. We argue 
that this might have allowed Em to keep working on creat- 
ing a procedure (as shown in snapshot C in Figure 4), which 
would have been detected as complete by the hybrid system 
only at this time. It is also possible that receiving inaccu- 
rate feedback at the very beginning may have led to Em’s 
mistrust in the system, since prior work shows that incorrect 
feedback can reduce students’ willingness to use it [48]. 


Note that, we do not believe this incorrect DD detection 


2We provide anonymous names for students. 
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Figure 4: Code snapshots A, B, and C, implemented by stu- 
dent Em in Case Study 1. 


means the full data-driven system should not be used since 
our prior work shows the system can be helpful to students 
[39]. However, we need to explore how to present subgoal 
feedback in a way that promotes students to question the 
feedback since any such program will inevitably fail to rec- 
ognize some correct variants of a subgoal solution. 


5.1.2. Case Study 2 (Jo): Cases when Hybrid Sub- 


goal Detections could be Incorrect 

We present here a case study where the hybrid model would 
not have provided accurate subgoal feedback. As evidenced 
by the model’s high overall accuracy (Table 2), these in- 
stances were rare, but understanding them highlights the 
affordances and limitations of our approach. Specifically, 
we investigated subgoal 1, where the hybrid model had the 
lowest F1-score. 


Student Jo correctly created a procedure with a parameter; 
however, since Jo had not yet used the procedure in their 
main script, subgoal 1 was not detected (accurate detection). 
Jo continued programming and completed subgoals 2 and 3, 
which were accurately detected by the system. Afterwards, 
Jo added a procedure call to the main script, and subgoal 1 
was detected as complete (accurate detection). However, Jo 
then added a ‘pen up’ block underneath the procedure call, 
where unexpectedly, the hybrid model changed subgoal 1’s 
status back to incomplete (incorrect detection). 


This false detection in the hybrid model was due to an 
overly-specific code shape. Specifically, the code shape re- 
quired the procedure call to be the last block in the main 
script (which was true for 94% of students, but not Jo), lead- 
ing to the false detection. This case confirms the importance 
of iteratively investigating and refining data-driven subgoal 
detections to keep improving their accuracy, which is a com- 
mon process in expert-authored models as well. While this 
false detection has a straightforward fix, similar to the ones 
presented in Section 3.2.2, it shows one limitation of the hy- 
brid model: creating these fixes requires the expert to find 
and address the false detections in the first place, which is 
dependent on finding the bugs in the data inspected. This is 
also one reason why the hybrid model performance of sub- 
goal 1 has a lower F1-score than the code2vec model (as 
shown in Table 2). 


6. DISCUSSION 


6.1 Automated Subgoal Detection 

The key contribution of this paper is tackling the critical 
challenge of automated subgoal detection during program- 
ming tasks. Our results show that a hybrid data-driven 
model meaningfully addresses this goal, with high accuracy 


and F1 score when detecting subgoals at key moments dur- 
ing students’ work. Our results show that this is a chal- 
lenging task: even a state-of-the-art supervised learning ap- 
proach with access to labeled data struggled to identify some 
subgoals (F1 score as low as 0.64). This agrees with prior 
work using expert-authored [13, 38] and supervised learning 
models [53], showing that immediate feedback on subgoals 
is a hard problem. 


While automatically detecting subgoals is challenging, re- 
search suggests that the ability to provide automated, im- 
mediate feedback on subgoals can significantly improve stu- 
dents’ motivation and learning. Providing subgoals for novices 
can improve student learning by breaking down the pro- 
gramming task into smaller subtasks, which is a challeng- 
ing task for novices [36, 38]. In human tutoring dialogues 
for programming, tutors provide a combination of corrective 
and positive feedback, increasing students’ motivation and 
confidence in programming [8, 33, 17]. Automatic subgoal 
detection could be used to provide similar corrective and 
positive feedback during programming. We know of only 
3 systems that can afford such immediate feedback, that 
is not based on unit tests, during programming, that have 
been shown to promote learning, confidence and persistence 
for linked lists [21], database queries [42], and block-based 
programming [38]. It is perhaps uncommon to make such 
systems due to the difficulty in anticipating all student ap- 
proaches, paired with the high potential for inaccuracies and 
student reactions to them. Our accuracy results suggest that 
our hybrid, humanized approach can be used to build similar 
automatic subgoal detection systems that could be deployed 
and more easily scaled across problems in real classrooms. 


6.2 Affordances of Data-driven, Hybrid and 


Expert models 
Our results suggest that the hybrid model has good poten- 
tial for solving the problem of automated subgoal detection. 
Here we discuss the advantages and trade-offs of the hybrid 
approach, compared to data-driven and expert models. 


6.2.1 RQI.a: Hybrid versus Data-Driven Models 
Our results show that a hybrid, iterative model that lever- 
ages data-driven subgoal extraction, human labeling, and 
expert refinement based on labeled student data, can greatly 
improve model performance compared to a purely data-driven 
(DD) model. The expert constraints improved F1-score of 
the data-driven model by 0.14-0.25 points, as shown in Ta- 
ble 2. Based on our analysis, the hybrid model, reduced the 
number of Early and Late subgoal detections and increased 
the onTime detections, when compared to the original DD 
model, as shown in Figure 3. This is a critical improvement, 
since prior work shows that the quality of feedback affects 
novices’ programming behavior [48, 51], but also their self- 
perceptions, and trust in the learning environment [48]. 


The hybrid model creation does require additional labelling 
effort needed to evaluate the models; however, this effort 
seems well worth it, and is needed to evaluate the accuracy 
of any data-driven model before deployment. Compared to 
prior work, our iterative hybrid model shares similar benefits 
of “human-in-the-loop” methods in machine learning [56, 30] 
and also represents data-driven rules in an interpretable and 
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editable form that simplified the process of merging human 
insights into the model. In prior work, Diana et al. found, 
qualitatively, that their generated data-driven rubrics are 
considerably similar to human-generated rubrics [19]. Sim- 
ilarly, in our work, not only did instructors agree with the 
hybrid model subgoal detections, we also found these detec- 
tions have substantial or better agreement with the human 
expert gold standard than the DD model. From these re- 
sults, we conclude that a hybrid model can be used to itera- 
tively improve and humanize data-driven subgoal detection. 


6.2.2 ROIb: Hybrid versus Supervised Learning Mod- 


els that Leverage Expert Subgoal Labels 

Our results show that a hybrid model can surpass the per- 
formance of supervised learning models. The steps we used 
to create our hybrid model were to: (1) apply a data-driven 
model, (2) add expert constraints, and (3) determine inter- 
esting datapoints consisting of times when subgoals might 
be achieved. The steps we used to create supervised learn- 
ing models leveraged the interesting datapoints from step 3 
of our hybrid model, hand-labeled them, and used them to 
build supervised learning models. We highlight this to point 
out that it is hard to determine what labeled data to use in a 
supervised learning model for subgoal detection, since there 
are hundreds of potential snapshots from each student. As 
a result, it is unclear whether the supervised model would 
have performed as well, compared to one learned from a 
labeled dataset selected at random or regular timestamps. 
Our results show that, even after using carefully selected 
labelled data to train the model, the SVM and XGBoost 
baselines did not achieve the level of accuracy of the hybrid 
model for all subgoals. Even code2vec, the state-of-the-art 
supervised learning model [53], has lower performance for all 
subgoals, except subgoal 1, than the hybrid model. Perhaps 
the code2vec performed worse due to the size of labelled 
dataset (687 datapoints), though recent results suggest the 
model is still effective with small datasets [52]. 


6.2.3 Hybrid versus Expert Models 

Our hybrid approach offers distinct advantages and trade-offs 
compared to expert-authored models for subgoal detection. 
Traditional expert-authored models (e.g. constraint-based 
tutors [42]) have the advantage of high accuracy and high 
confidence, but the trade-offs of considerable domain expert 
time for creation and the potential failure to anticipate some 
student strategies and misconceptions. Our hybrid model 
has the advantage of incorporating actual student strategies 
and misconceptions, and primarily requires human effort to 
label data and identify errors, tasks which can potentially be 
done by non-experts and distributed across multiple people. 
A domain expert is only needed to edit the automatically ex- 
tracted rules and is afforded the chance to do so with actual 
student data available. A significant tradeoff of the hybrid 
model is its reliance on data - so the quality of the dataset 
will directly impact the quality of the subgoal detectors. 
Furthermore, both models are likely to need refinement as 
students use them, and this process is already built into the 
hybrid model creation and refinement cycle. 


7. LIMITATIONS & CONCLUSION 
This work has 5 main limitations. First, while the DD 
model can capture small differences in solution approaches 


in Squiral (like having whether a ‘turnLeft’ or ‘turnRight’ 
block), we have not tested it in programming tasks with 
larger space of solution approaches. Therefore, it is not 
known how well the accuracy results will generalize to other 
types of programming tasks or languages. However, the it- 
erative process of data collection, DD subgoal extraction, 
labeling, and collection of data from students using the sub- 
goal labels and detectors, could be applied for other pro- 
gramming problems, of the same level as Squiral, and re- 
peated until the models achieved high accuracy. Second, 
some of 520 data that was used for models’ evaluation was 
also used to inform expert constraints in the hybrid model. 
However, this was only 66% of the data, and we discussed 
above how the added constraints are generalizable, which 
should have helped in any semester (see Section 5.1). Ad- 
ditionally, our case studies in Section 5.1 show examples of 
how the hybrid detector performed on unseen data, though 
there was insufficient data for a quantitative evaluation. 


Third, we used only the labelled $20 data to train the su- 
pervised baseline models, but we also used 3 other semesters 
of unlabeled data to train the unsupervised DD and hybrid 
models. However, we argue that this ability to leverage a 
larger unlabeled dataset is an advantage of the unsupervised 
methods, rather than a limitation of our analysis. Fourth, 
some of the datapoints that were labeled for the evaluation of 
all the models were selected in part by using the hybrid and 
DD model detections, as discussed above, and this might 
have biased the results. However, all the models were evalu- 
ated on these same datapoints that were strategically chosen 
for their importance, and there are instances where some of 
the supervised models outperformed the original DD model. 
It is not clear how a different data selection strategy would 
have affected the results, and we argue that training and 
testing the supervised models on a dataset of the same size 
with randomly selected snapshots would likely decrease the 
performance of supervised models. Finally, we did not com- 
pare the hybrid model to a purely expert-authored model, 
and we did not measure time taken by experts to modify the 
data-driven rules. We argue that these comparisons require 
hiring experts to author rules and performing time analysis, 
which is beyond the scope of this paper. 


In summary, this work proposes a new paradigm for ‘hu- 
manizing’ data-driven subgoal detection for novice program- 
ming. Specifically, we proposed to humanize data-driven 
subgoals in an iterative refinement process. We (1) extract 
data-driven subgoals from student work, (2) give them hu- 
man labels, (3) collect more data from students program- 
ming with the labels and subgoal detectors, (4) present ex- 
perts with the labels, and interpretable detectors, along with 
student behavior data so they can add expert constraints. 
This process ensures that humans are involved in every step 
of the creation of automatic subgoals, offering the advan- 
tages of reflecting real student behaviors, and limiting and 
focusing expert authoring time. Our results show that this 
hybrid humanized model outperforms fully data-driven mod- 
els and state-of-the-art supervised learning models. This 
proposed paradigm can be used to create humanized auto- 
matic subgoal detection for tasks where it may be too expen- 
sive to create full expert models for, but that are important 
for student learning, motivation, and retention. 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 77 


8. 
[1] 


[9] 


[10] 


[11] 


[12] 


[13] 


[14] 


[15] 


78 


REFERENCES 

V. Aleven, B. M. McLaren, and J. Sewall. Scaling up 
programming by demonstration for intelligent tutoring 
systems development: An open-access web site for 
middle school mathematics learning. [EEE 
transactions on learning technologies, 2(2):64—78, 2009. 
V. Aleven, B. M. McLaren, J. Sewall, M. Van Velsen, 
O. Popescu, S. Demi, M. Ringenberg, and K. R. 
Koedinger. Example-tracing tutors: Intelligent tutor 
development for non-programmers. International 
Journal of Artificial Intelligence in Education, 
26(1):224—269, 2016. 

U. Alon, M. Zilberstein, O. Levy, and E. Yahav. A 
general path-based representation for predicting 
program properties. PLDI’18, 2018. 

U. Alon, M. Zilberstein, O. Levy, and E. Yahav. 
code2vec: Learning distributed representations of 
code. POPL’19, 2019. 

J. R. Anderson, A. T. Corbett, K. R. Koedinger, and 
R. Pelletier. Cognitive tutors: Lessons learned. The 
journal of the learning sciences, 4(2):167—207, 1995. 
D. Azcona, P. Arora, I.-H. Hsiao, and A. Smeaton. 
user2code2vec: Embeddings for profiling students 
based on distributional representations of source code. 
In LAK’19, 2019. 

M. Ball. Lambda: An Autograder for snap. Technical 
report, Electrical Engineering and Computer Sciences 
University of California at Berkeley, 2018. 

K. E. Boyer, R. Phillips, M. D. Wallis, M. A. Vouk, 
and J. C. Lester. Learner characteristics and feedback 
in tutorial dialogue. In Proceedings of the Third 
Workshop on Innovative Use of NLP for Building 
Educational Applications, pages 53-61. Association for 
Computational Linguistics, 2008. 

T. Chen, T. He, M. Benesty, V. Khotilovich, Y. Tang, 
H. Cho, et al. Xgboost: extreme gradient boosting. R 
package version 0.4-2, 1(4), 2015. 

C.-Y. Chou, B.-H. Huang, and C.-J. Lin. 
Complementary machine intelligence and human 
intelligence in virtual teaching assistant for tutoring 
program tracing. Computers & Education, 
57(4):2303-2312, 2011. 

A. Corbett and J. R. Anderson. Locus of Feedback 
Control in Computer-Based Tutoring: Impact on 
Learning Rate, Achievement and Attitudes. In 
Proceedings of the SIGCHI Conference on Human 
Computer Interaction, pages 245-252, 2001. 

A. Corbett, L. Kauffman, B. Maclaren, A. Wagner, 
and E. Jones. A cognitive tutor for genetics problem 
solving: Learning gains and student modeling. Journal 
of Educational Computing Research, 42(2):219-239, 
2010. 

A. T. Corbett and J. R Anderson. Knowledge 
decomposition and subgoal reification in the act 
programming tutor. 1995. 

C. Cortes and V. Vapnik. Support-vector networks. 
Machine learning, 20(3):273-297, 1995. 

S. Custer, E. M. King, T. M. Atinc, L. Read, and 

T. Sethi. Toward data-driven education systems: 
Insights into using information to measure results and 
manage change. Center for Universal Education at 
The Brookings Institution, 2018. 


[16] 


[17] 


[18] 


[19] 


[20] 


[21] 


[22] 


[23] 


[24] 


[25] 


[26] 


[27] 


[28] 


[29] 


J. Denner and L. Werner. Computer programming in 
middle school: How pairs respond to challenges. 
Journal of Educational Computing Research, 
37(2):131-150, 2007. 

B. Di Eugenio, D. Fossati, S. Ohlsson, and D. Cosejo. 
Towards explaining effective tutorial dialogues. In 
Annual Meeting of the Cognitive Science Society, 
pages 14380-1485, 2009. 

N. Diana, M. Eagle, J. Stamper, S. Grover, 

M. Bienkowski, and S. Basu. Data-driven generation 
of rubric parameters from an educational 
programming environment. In International 
Conference on Artificial Intelligence in Education, 
pages 490-493. Springer, 2017. 

N. Diana, M. Eagle, J. Stamper, S. Grover, 

M. Bienkowski, and S. Basu. Data-driven generation 
of rubric criteria from an educational programming 
environment. In Proceedings of the 8th International 
Conference on Learning Analytics and Knowledge, 
pages 16-20, 2018. 

M. L. Epstein, A. D. Lazarus, T. B. Calvano, K. A. 
Matthews, R. A. Hendel, B. B. Epstein, and G. M. 
Brosvic. Immediate feedback assessment technique 
promotes learning and corrects inaccurate first 
responses. The Psychological Record, 52(2):187-201, 
2002. 

D. Fossati, B. Di Eugenio, S. Ohlsson, C. Brown, and 
L. Chen. Data driven automatic feedback generation 
in the ilist intelligent tutoring system. Technology, 
Instruction, Cognition and Learning, 10(1):5-26, 2015. 
D. Garcia, B. Harvey, and T. Barnes. The beauty and 
joy of computing. AC'M Inroads, 6(4):71-79, 2015. 

V. G. Goecks. Human-in-the-loop methods for 
data-driven and reinforcement learning systems. arXiv 
preprint arXiv:2008.13221, 2020. 

L. Gusukuma, A. C. Bart, D. Kafura, and J. Ernst. 
Misconception-driven feedback: Results from an 
experimental study. In Proceedings of the 2018 ACM 
Conference on International Computing Education 
Research, pages 160-168, 2018. 

L. Gusukuma, D. Kafura, and A. C. Bart. Authoring 
feedback for novice programmers in a block-based 
language. In 2017 IEEE Blocks and Beyond Workshop 
(B&B), pages 37-40. IEEE, 2017. 

P. Ihantola, T. Ahoniemi, V. Karavirta, and 

O. Seppala. Review of recent systems for automatic 
assessment of programming assignments. In 
Proceedings of the 10th Koli calling international 
conference on computing education research, pages 
86-93, New York, NY, 2010. ACM. 

J. Jeuring, L. T. van Binsbergen, A. Gerdes, and 

B. Heeren. Model solutions and properties for 
diagnosing student programs in ask-elle. In 
Proceedings of the Computer Science Education 
Research Conference, pages 31—40, 2014. 

D. E. Johnson. Itch: Individual testing of computer 
homework for scratch assignments. In Proceedings of 
the 47th ACM Technical Symposium on Computing 
Science Education, pages 223-227. ACM, 2016. 

D. E. Johnson. Itch: Individual testing of computer 
homework for scratch assignments. In Proceedings of 
the 47th ACM Technical Symposium on Computing 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


30 


31 


32 


33 


[34] 


[35] 


[36] 


[37] 


[38] 


[39] 


[40] 


[41] 


[42] 


[43] 


[44] 


Science Education, pages 223-227, New York, NY, 
2016. ACM. 

B. Kim. Interactive and interpretable machine learning 
models for human machine collaboration. PhD thesis, 
Massachusetts Institute of Technology, 2015. 

N. Kérber, K. Geldreich, A. Stahlbauer, and 

G. Fraser. Finding anomalies in scratch assignments. 
arXiv preprint arXiv:2102.07446, 2021. 

E. Lahtinen, K. Ala-Mutka, and H.-M. Jarvinen. A 
study of the difficulties of novice programmers. Acm 
Sigcse Bulletin, 37(3):14-18, 2005. 

M. R. Lepper, M. Woolverton, D. L. Mumme, and 

J. Gurtner. Motivational techniques of expert human 
tutors: Lessons for the design of computer-based 
tutors. Computers as cognitive tools, 1993:75-105, 
1993. 

A. Luxton-Reilly, I. Albluwi, B. A. Becker, 

M. Giannakos, A. N. Kumar, L. Ott, J. Paterson, 

M. J. Scott, J. Sheard, and C. Szabo. Introductory 
programming: a systematic literature review. In 
Proceedings Companion of the 23rd Annual ACM 
Conference on Innovation and Technology in 
Computer Science Education, pages 55-106, 2018. 

L. E. Margulieux and R. Catrambone. Finding the 
best types of guidance for constructing 
self-explanations of subgoals in programming. Journal 
of the Learning Sciences, 28(1):108-151, 2019. 

L. E. Margulieux, R. Catrambone, and M. Guzdial. 
Employing subgoals in computer programming 
education. Computer Science Education, 26(1):44-67, 
2016. 

L. E. Margulieux, M. Guzdial, and R. Catrambone. 
Subgoal-labeled instructional material improves 
performance and transfer in learning to develop 
mobile applications. In Proceedings of the ninth annual 
international conference on International computing 
education research, pages 71-78, 2012. 

S. Marwan, G. Gao, S. Fisk, T. W. Price, and 

T. Barnes. Adaptive immediate feedback can improve 
novice programming engagement and intention to 
persist in computer science. In Proceedings of the 2020 
ACM Conference on International Computing 
Education Research, pages 194—203, 2020. 

S. Marwan, T. W. Price, M. Chi, and T. Barnes. 
Immediate data-driven positive feedback increases 
engagement on programming homework for novices. In 
Educational Data Mining in Computer Science 
Education (CSEDM) Workshop @ EDM’20, 2020. 

J. McKendree. Effective feedback content for tutoring 
complex skills. Human-computer interaction, 
5(4):381-413, 1990. 

A. Mitrovic and S. Ohlsson. Evaluation of a 
constraint-based tutor for a database language. 1999. 
A. Mitrovic, S. Ohlsson, and D. K. Barrow. The effect 
of positive feedback in a constraint-based intelligent 
tutoring system. Computers & Education, 
60(1):264-272, 2013. 

D. Palmer. A motivational view of 
constructivistainformed teaching. International 
Journal of Science Education, 27(15):1853-1881, 2005. 
G. D. Phye and T. Andre. Delayed retention effect: 
attention, perseveration, or both? Contemporary 


[45] 


[46] 


[47] 


[48] 


[49] 


[50] 


[51] 


[52] 


[53] 


[54] 


[55] 


[56] 


[57] 


[58] 


Educational Psychology, 14(2):173-185, 1989. 

T. W. Price, Y. Dong, and D. Lipovac. iSnap: 
Towards Intelligent Tutoring in Novice Programming 
Environments. In Proceedings of the ACM Technical 
Symposium on Computer Science Education, 2017. 

T. W. Price, Z. Liu, V. Catete, and T. Barnes. Factors 
Influencing Students’ Help-Seeking Behavior while 
Programming with Human and Computer Tutors. In 
Proceedings of the International Computing Education 
Research Conference, 2017. 

T. W. Price, R. Zhi, and T. Barnes. Evaluation of a 
Data-driven Feedback Algorithm for Open-ended 
Programming. In Proceedings of the International 
Conference on Educational Data Mining, 2017. 

T. W. Price, R. Zhi, and T. Barnes. Hint Generation 
Under Uncertainty: The Effect of Hint Quality on 
Help-Seeking Behavior. In Proceedings of the 
International Conference on Artificial Intelligence in 
Education, 2017. 

K. Rivers and K. R. Koedinger. Data-Driven Hint 
Generation in Vast Solution Spaces: a Self-Improving 
Python Programming Tutor. International Journal of 
Artificial Intelligence in Education, 27(1):37-64, 2017. 
C. Romero and 8. Ventura. Educational data mining 
and learning analytics: An updated survey. Wiley 
Interdisciplinary Reviews: Data Mining and 
Knowledge Discovery, 10(3):e1355, 2020. 

P. Shabrina, S. Marwan, T. W. Price, M. Chi, and 
T. Barnes. The impact of data-driven positive 
programming feedback: When it helps, what happens 
when it goes wrong, and how students respond. In 
Educational Data Mining in Computer Science 
Education (CSEDM) Workshop @ EDM’20, 2020. 

Y. Shi, Y. Mao, T. Barnes, M. Chi, and T. W. Price. 
More with less: Exploring how to use deep learning 
effectively through semi-supervised learning for 
automatic bug detection in student code. EDM, 2021. 
Y. Shi, K. Shah, W. Wang, S. Marwan, P. Penmetsa, 
and T. Price. Toward semi-automatic misconception 
discovery using code embeddings. In The 11th 
International Conference on Learning Analytics 
Knowledge (LAK 21), 2021. 

V. J. Shute. Focus on formative feedback. Review of 
educational research, 78(1):153-189, 2008. 

D. Sleeman, A. E. Kelly, R. Martinak, R. D. Ward, 
and J. L. Moore. Studies of diagnosis and remediation 
with high school algebra students. Cognitive Science, 
13(4):551-568, 1989. 

R. Souza, L. Neves, L. Azevedo, R. Luiz, E. Tady, 

P. R. Cavalin, and M. Mattoso. Towards a 
human-in-the-loop library for tracking hyperparameter 
tuning in deep learning development. In LADaS@ 
VLDB, pages 84-87, 2018. 

J. Stamper, T. Barnes, and M. Croy. Enhancing the 
automatic generation of hints with expert seeding. 
International Journal of Artificial Intelligence in 
Education, 21(1-2):153-167, 2011. 

J. Stamper, T. Barnes, L. Lehmann, and M. Croy. 
The hint factory: Automatic generation of 
contextualized help for existing computer aided 
instruction. In Proceedings of the 9th International 
Conference on Intelligent Tutoring Systems Young 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 79 


Researchers Track, pages 71-78, 2008. 

[59] D. Toll, A. Wingkvist, and M. Ericsson. Current state 
and next steps on automated hints for students 
learning to code. In 2020 IEEE Frontiers in Education 
Conference (FIE), pages 1-5. IEEE, 2020. 

[60] J.-P. Vert, K. Tsuda, and B. Schélkopf. A primer on 
kernel methods. Kernel methods in computational 
biology, 47:35—70, 2004. 

[61] D. Wang, W. Dong, and Y. Zhang. Collective 
intelligence for smarter neural program synthesis. In 
2020 35th IEEE/ACM International Conference on 
Automated Software Engineering Workshops (ASEW), 
pages 98-104. IEEE, 2020. 

[62] L. E. Winslow. Programming pedagogy—a 
psychological overview. ACM Sigcse Bulletin, 
28(3):17-22, 1996. 

[63] D. Xin, L. Ma, J. Liu, S. Macke, S. Song, and 
A. Parameswaran. Accelerating human-in-the-loop 
machine learning: challenges and opportunities. In 
Proceedings of the Second Workshop on Data 
Management for End-To-End Machine Learning, 
pages 1-4, 2018. 

[64] R. Zhi, T. W. Price, N. Lytle, and T. Barnes. 
Reducing the State Space of Programming Problems 
through Data-Driven Feature Detection. In 
Proceedings of the Educational Data Mining in 
Computer Science Education Workshop at the 
International Conference on Educational Data Mining, 
2018. 


80 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


Automatically classifying student help requests: a 
multi-year analysis 


Zhikai Gao 
North Carolina State 
University 


zgao9@ncsu.edu 


Sarah Heckman 
North Carolina State 
University 


sarah_heckman@ncsu.edu 


ABSTRACT 


As Computer Science has increased in popularity so too 
have class sizes and demands on faculty to provide sup- 
port. It is therefore more important than ever for us to 
identify new ways to triage student questions, identify com- 
mon problems, target students who need the most help, and 
better manage instructors’ time. By analyzing interaction 
data from office hours we can identify common patterns, 
and help to guide future help-seeking. My Digital Hand 
(MDH) is an online ticketing system that allows students 
to post help requests, and for instructors to prioritize sup- 
port and track common issues. In this research, we have 
collected and analyzed a corpus of student questions from 
across six semesters of a CS2 with a focus on object-oriented 
programming course [17]. As part of this work, we grouped 
the interactions into five categories, analyzed the distribu- 
tion of help requests, balanced the categories by Synthetic 
Minority Oversampling Technique (SMOTE) , and trained 
an automatic classifier based upon LightGBM to automat- 
ically classify student requests. We found that over 69% of 
the questions were unclear or barely specified. We proved 
the stability of the model across semesters through leave one 
out cross-validation and the target model achieves an accu- 
racy of 91.8%. Finally, we find that online office hours can 
provide more help for more students. 


Keywords 
Office Hour, Computer Science Education Research, Text 
Analysis, help-seeking request 


1. INTRODUCTION 


Over the past decade the popularity of CS majors has in- 
creased and enrollments have skyrocketed [2]. This has cre- 
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ated challenges for instructors with increasing demands for 
individual support, collaborative learning, and automated 
guidance [12, 2, 16, 15]. As the size of courses and cohorts 
have increased, the demand for office hours has begun to ex- 
ceed the time that instructors and staff have available [12, 
13]. To address these needs instructors have adopted a wide 
range of innovative support models including virtual office 
hours [10], peer support [7], and ticketing systems for help- 
seeking interactions [13]. The last approach is exemplified 
by My Digital Hand (MDH) [21], an online support sys- 
tem for office hours which allows students to queue for office 
hours, post questions in advance, and record the outcome 
of interactions. MDH assists students in structuring their 
help-seeking interactions with teaching staff. It also assists 
instructors and teaching assistants (TA) in managing their 
courses, by allowing them to triage student questions and 
target their effort during office hours to be efficient and meet 
group and individual needs. MDH also tracks help-seeking 
and interaction data throughout the whole semester. Using 
this data, we can identify patterns in students’ help requests 
and automatically classify the questions. One common chal- 
lenge for help-seeking interaction on large classes arises when 
many students ask the same or similar questions but must 
be dealt with separately thus eating up limited instructor 
time. One approach to address this is to develop automated 
Q&A systems which can leverage common problems. In or- 
der for this to work however, students must provide sufficient 
information about their problems so that they can receive 
targeted support. 


Our goal is to develop analytical methods to understand 
what kinds of help students seek during office hours, how 
they frame their questions to the instructors, and whether 
or not we can automatically classify questions to support 
guidance and time management. By analyzing students’ 
help requests across course offerings we can better under- 
stand what kinds of challenges the students are facing, and 
how the teaching staff can better anticipate students’ needs 
and target their limited support. Moreover, by automat- 
ically classifying help requests we can help teaching staff 
to efficiently triage student questions and identify common 
problems that may be solved with group support or peer 
assistance. Over the long term we will develop summary 
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statistics which can be used to support instructors in course 
management, and we will augment our existing ticketing sys- 
tem with support for automatic categorization. 


In this paper we will address four specific research questions 
in the context of a CS2, object-oriented-focused course: 


e RQ1: What types of questions do students ask on 
MDH during office hours and how do they formulate 
their description? 


e RQ2: How can we automatically classify student help 
requests and tickets? 


e RQ3: How robust is our classification model across 
different offerings? 


e RQ4: Compared to regular office hours, does online 
office hours provide more benefits? 


In order to address RQ1 we analyzed our dataset to iden- 
tify common patterns of student questions and to classify 
them into five categories. We then address RQ2 and RQ3 by 
training an automated classifier for student questions with 
the goal of evaluating its’ stability across semesters. Due 
to the COVID-19 pandemic, all courses are operating online 
in Fall 2020, which gives us an opportunity to study the 
advantages and disadvantages of hosting office hour online. 
Therefore we analyzed and compared the data pattern on 
Fall 2020(F20) with other regular semesters in RQ4. 


1.1 Background 

Prior researchers have analyzed student help requests with 
the goal of understanding student behaviors. Xu and Lynch, 
for example applied deep learning approaches to classify stu- 
dent question topics in MOOC discussion forums [24]. In 
that work Xu and Lynch collected student posts from two 
offerings of a MOOC on Big Data in Education. The authors 
classified student questions into one of three types (Course 
Content Question, Technique Question, and Course Logic 
Question) and developed an automatic classifier using Re- 
current Neural Networks to divide questions’ into those three 
categories. While the models were successful within a single 
offering they Xu and Lynch, found that they did not gener- 
alize across offerings. Thus, the system suffered from a cold 
start problem on each semester. 


Vellukunnel et al. in turn collected Piazza posts from CS2 
courses offered at two institutions and analyzed the type 
and distribution of the questions students asked [22]. As 
part of this work they manually partitioned the questions 
into five categories and then analyzed the impact of stu- 
dents’ question types on their final grades. They concluded 
that asking constructive questions can help students to de- 
velop a better understanding of the course materials and in 
turn receive better grades. This analysis has informed our 
own work. However the Piazza platform, unlike MDH, is 
designed to support interactive discussion and online peer 
support through the use of threads and replies. By contrast 
the MDH system is focused on initial help seeking and not 
on collaborative dialogue. Therefore it is unclear whether 
our results will align with theirs. 


Prior researchers have also studied how instructors man- 
age office hours and how to make face to face support time 
more efficient and effective. Guzdial, for example, argued 
that office hours should incorporate diverse teaching tech- 
niques including pair programming, peer instruction, and 
backward design. These approaches, he argued, would po- 
tentially work to reduce wait times and support enhanced 
learning outcomes [8]. In order to provide more convenience 
for students, Harvard University introduced virtual office 
hours to an introductory programming course CS50 so that 
students can interact with teaching staff online [14]. How- 
ever, they found that those virtual sessions were often ineffi- 
cient and took more time to address the students’ problems. 
This research is complicated by the fact that students fre- 
quently avoid seeking help from teaching staff when they 
need it [1]. Some of the factors behind this help-avoidance 
include a lack of trust in the tutor’s abilities, inaccessibil- 
ity of office hours due to timing or other constraints, and a 
desire for independence in learning [18]. While our research 
provides some guidance on the design of office hours and the 
need to reach out to students, the impact of how students 
frame their help requests has not yet been analyzed exten- 
sively. One notable exception is the work of Ren, Krish- 
namurthi, and Fisler, who designed a survey-based method 
to help track the students’ help-seeking interactions during 
office hours in programming-based CS courses [20]. While 
informative, their approach is difficult to generalize as it 
depends on requiring the teaching staff to complete a de- 
tailed form after every interaction. In MDH, by contrast, 
we collect much of the data upfront as an integral part of 
the process. 


2. METHODS 


2.1 MDH system 

My Digital Hand (MDH) [21], is a ticketing system for office 
hours that was developed to facilitate large CS courses. Stu- 
dents using MDH request help during office hours by raising 
a virtual hand”, that is creating a ticket which lists the topic 
they need help on, describes the issue they are facing, and 
the steps they have taken to address it. Once the ticket is 
created it is visible to the teaching staff who can then use it 
to prioritize interactions or even group students together for 
help. Once the interaction is complete the teaching staff can 
close the ticket and describe how the interaction played out. 
Students are also given the opportunity to evaluate the help 
received. These feedback questions are configurable and set 
by instructors at the start of the semester. 


This data allows instructors to identify common issues facing 
students and to track the time it takes for students to re- 
ceive support from the teaching staff as well as the duration 
of each help session. A prior analysis of MDH data, Smith 
et al. found that 5% of students in a course accounted for 
50% of office hour time, and that long individual interaction 
times, representing students who needed long and detailed 
guidance, served to delay many other short questions [21]. 
They concluded that a small but critical group of students 
are reliant on individual tutoring via office hours, while other 
students who need intermittent help are often unable to ob- 
tain support. These findings have motivated our own focus 
on developing analytical tools which can be used to analyze, 
prioritize, and manage help requests so that high-demand 
students do not shut out their peers. 
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2.2. Data Collection 


We analyzed data from seven semesters of a typical second- 


semester Object-Oriented programming course [17] at a research- 


intensive public university in the south-eastern United States. 
Basic descriptive statistics for the dataset are shown in Ta- 
ble 1. Students produced an average of 1477 tickets per 
semester with higher volumes in the fall semesters due to 
larger class sizes. The number of tickets also increased year 
over year due in part to larger class sizes, greater emphasis 
on tool use by the course instructor, and higher per-capita 
demand for office hours. 


The course is structured as a single lecture section with 12 
small-group lab sessions which are held weekly. Over the 
course of the semester students complete weekly lab assign- 
ments, 2-3 individual or team Projects (C-Projects), and 3 
separate Guided Projects (G-Projects). The G-Projects are 
designed to provide a review of prerequisite materials and 
introduce students to new concepts. The C-Projects are 
generally structured as two sub-assignments, one focused on 
design and system testing, and the other on implementa- 
tion and unit testing. Students manage their code via the 
GitHub platform with integrated support for the Jenkins au- 
tomation testing server. When students commit code they 
receive automated testing results from instructor-authored 
test cases as well as test cases that they supplied. The stu- 
dents use feedback from test failures to guide their work and 
their help-seeking. [6] 


In Fall 2020, this course was moved fully online due to the 
pandemic. All office hours were hosted through zoom meet- 
ing where students can share their screen with the teaching 
staff to show their code or any problems. 


Table 1: Number of tickets and students for each semester 
(F= Fall, S=Spring) 


F17 $18 F18 $19 F19 $20 F20 
tickets 1146 609 1224 860 1650 1401 3452 
students 208 157 259 174 256 191 = 308 


The interaction data is the most important for our current 
analysis. The format of the interaction records, along with 
selected examples is shown in Table 2. For this analysis each 
ticket consists of three major parts: the participants, time 
and duration of interaction, and the context. 


2.3 RQ1: Categorization of Questions 

Our primary focus in RQ1 is to identify the types of ques- 
tions the students are asking and to understand how they 
describe their work. MDH allows students to frame their 
question topic or description in any way that they wish. 
The platform does not provide a list of suggested topics or 
mandate content beyond the basic text. This, in turn, lead 
students to vary widely in the descriptions and content that 
they provide. We therefore studied two features of the ques- 
tions with the goal of supporting classification, the students’ 
topic, as contained in the "I’m working on" field. And the 
longer problem description, as stated in the "my problem 
is" field. 


2.3.1 Classified by Topic 


Table 2: Attributes of the interaction data 


Attributes Content Explain Example 

interaction id Id for each ticket 30072 

student id Id for the student 1950 
who raised this ticket 

teacher id Id for teacher who 20810 


deal with the ticket 
Timestamp for each 2019- 
tickets that are asked 03-08 


time raised hand 


19:34:09 
time interaction Timestamp for each 2019- 
began tickets began 03-08 
19:38:39 
time interaction Timestamp for each 2019- 
ended tickets ended 03-08 
20:01:02 
I’m working on Topic for the ques- Program1Part1 


tion of each ticket 


my problem is Detail statement for Null 


the question of each Pointer on 
ticket TS test 

I’ve tried The solution the stu- Debugging 
dent tried before they 


raised the tickets 


Rapidly identifying, or even anticipating, students’ question 
topics would allow teaching staff to anticipate the kinds of is- 
sues they should be prepared for and may also allow them to 
set up mini-groups within office hours to deal with problems 
assignment by assignment, or to separate code questions 
from conceptual ones. We therefore performed a manual 
analysis of the topics in our study dataset over all semesters 
with the goal of determining how students label their topics, 
and whether it is possible to either anticipate or sort their 
posts as they come in. 


Our preliminary analysis showed that in most cases the stu- 
dents simply entered the name of their current assignment 
or an abbreviation of it and provided no other details. More- 
over, due to the structure of the course deadlines almost ev- 
ery help request in a given session was focused on the same 
assignment. In the newest version of MDH, the question is 
now a check box and the instructor can set the assignments. 
As a consequence we decided to omit this from our classifi- 
cation task and focus on the types of help being sought. 


2.3.2 Classified by Description 

In the description section (“my problem is”), the students 
can provide a rich summary of their problem including a 
text description, bug reports, or even code snippets. If it is 
possible to automatically classify student posts then we can 
use that approach to triage student questions as they come 
in, perhaps separating long questions from short. We there- 
fore performed a manual analysis of the description content 
as well with the goal of identifying useful categories of posts. 
We also sought to examine how complex the problem de- 
scriptions were. In our prior discussions with the teaching 
staff they reported that many students provide too little 
information in the description (e.g. a single word such as 
”Errors”), provide too much (e.g. a full execution dump and 
error log), or they simply type gibberish with the simple 
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goal of securing a place in line. All of these strategies are 
problematic either because they provide too little informa- 
tion to effectively triage posts, or because the dump is too 
complex and likely out of date before the student reaches 
the head of the line. In our analysis we analyzed both the 
length and structure of the students’ submissions as well 
with the goal of understanding whether we can provide au- 
tomatic scaffolding for useful posts, and automatic triage of 
the submissions. 
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Figure 1: Distribution of the number of words in the ques- 
tions’ description 
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Figure 2: Top 20 Popular words in question description 


We examined the length, content, and complexity of the stu- 
dents’ problem description including the specific my prob- 
lem is prompt. We also grouped common words to find 
keywords that are associated with specific types of submis- 
sions. We then used this information to inform our under- 
standing of the students’ posting behavior and to inform the 
design of the posting categories. This preliminary analysis 
was consistent with the experience described by the teaching 
staff. Figure 1 shows that the average description was less 
than five words long. When reviewing those short posts we 
noted that many students preferred to use keywords to indi- 
cate their question topics and problems rather than spelling 
their issues out. For example, when encountering an error 
in implementing an add() function, they often put ”add()” 
as the problem description assuming that the method name, 


together with the assignment information, provided enough 
context for the help request. 


This led us to focus on the specific terms that students use 
in their problem description. In this analysis, we grouped 
words by stemming and ignored stopwords to focus on the 
primary content information. The top 20 words are shown 
in Figure 2, top among them being test. This is consis- 
tent with the design of the assignments where students were 
provided with tests and required to develop their own. It 
is also consistent with the teaching staff’s observation that 
many students focus on the tests as a guide for their progress 
and for where they need help. Upon closer examination of 
questions using this word we found that most interactions 
were focused on failed test cases; a typical description for 
a question of this type was ”2 test case fail”. It would be 
difficult for instructors to interpret these without review- 
ing the code and the test results in more detail but such a 
review takes time. Another closely-related word that was 
common in this dataset is error which was used primarily 
when students encounter bugs or other failures. In these 
cases in particular, the teaching staff noted that some stu- 
dents would simply paste the crash report into the question 
with little other context. This kind of behavior is rare in the 
data but was also useful for instructors, we therefore used it 
as an additional factor. 


Based upon this preliminary analysis we defined five cate- 
gories of help requests based upon the problem descriptions. 
These categories are shown in Table 3. We then labeled 
all interactions related to the problem manually. For each 
question, we also ranked the clarity or comprehensibility of 
student questions based upon the description provided. As 
we discuss below, most of the questions provided insufficient 
information to diagnose the problem. However as Figure 1 
shows, some students did elaborate on their problem thor- 
oughly as measured by the number of words in the problem 
description. 


2.4 Labeling Process 
2.4.1 Code book 


To investigate the distribution of the above five categories 
in our data, we first need to set up a standard to categorize 
our data and apply it. All seven semesters’ data was labeled 
by one researcher by the following rules: 


e Check if there is any text that is clearly an error mes- 
sage copied from the compiler or a test failure. If so, 
label it as Copied Error. Notice that if the student 
describes the error message in their own words, then 
it should also be classified as Sufficient. 


e Check if there is any text indicating that this is a test 
problem, no matter if the description gives you the 
detail of their test error or not. If you are sure that it 
is a test problem, label it as Test. If it also qualifies as 
Copied Error, classified as Copied Error 


e If the text does not provide any information about 
their question and you cannot understand or deduce 
anything that related to their question, classify as Use- 
less 
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Table 3: Explanation and example for all five categories we developed 


Category Explanation Example description(my problem is) 
Useless The description contain nothing re- I would like to check of it 

lated to the question 

Insufficient | The description contains partial in- TicketManager getInstanceOf 
formation about the student ques- 
tion, but not enough for instructors 
to understand the details. 

Sufficient Contains enough detail on the ques- CourseRecordIOTest, I seem to be fail- 


tion for instructors to understand. 
Usually a very clear sentence. 


ing reading the files, at the moments it 
is testing the size of the ArrayLists, but 
I am passing writing the files 


Copied Error Contains a copied error from the I got this error: TypeError: barh() got 


compiler. 


Test Test case fail related problem 


multiple values for argument ’width’ 
2 test cases failed 


e If the text does provide what or where the problem is, 
but not enough for you to fully understand or identify 
what is their question, marked as Insufficient 


e If the text only contains one or multiple words of the 
method name associated with a problem, it can help to 
localize the problem but provides no additional details. 
It should be classified as Insufficient 


e If the text is in a form of “I don’t understand xxxx” 
without further explanation of which part they do not 
understand or other details, classify as Insufficient 


e If the text is in a form of “I don’t understand xxxx” 
with some further explanation, classify as Sufficient 


e If the text tells you what their question or describes 
how they encounter this problem, classify as Sufficient 


2.4.2 Inter rater reliability 

After all data was labeled, we randomly generated a subset 
of 150 unique questions(30 for each category) and sent it 
to another researcher to rate. In this subset, we reveal the 
label of 10 questions for each category as example data and 
the rater classifies the remaining questions based on those 
example data and the code book. Then we compare the 
result with the original labels and calculate Kappa to repre- 
sent the inter rater reliability. Kappa[4] is widely applied for 
measuring the agreement between two coders that accounts 
for chance agreement. Generally a score higher than 0.8 is 
considered acceptable. In our cases, the final unweighted 
Kappa value is 0.815 which is acceptable. 


2.5 RQ2: Modeling 


In addressing RQ2 we drew on our basic categorization de- 
veloped in RQ1 to train automatic classifiers that can triage 
posts by topic and content. We used the first six semester 
data as training data to train our model and the last semester 
(F20) as the testing set to evaluate our model. To train our 
classification model, we first extracted training features from 
the problem descriptions across our dataset. The features 
included content features such as the keywords described 
above as well as meta-text features such as length, the num- 
ber of stop-words (as a general proxy for specificity), the 


punctuation, and the character case. These meta-text fea- 
tures have the advantage that they are easy to extract au- 
tomatically and can therefore be used for automated triage. 
Length, for example, is a suitable proxy for completeness 
and coherence while punctuation and case shifting are com- 
mon in error messages. The full list of these features is 
shown in table 4. 


We represented the text features as a tf-idf [19] matrix and 
basic word count matrix over the content. The word count 
matrix is simply a 2D Array which describe how many times 
each term appears in each question text. The tf-idf ma- 
trix is the product of two statistics, term frequency and 
inverse document frequency. The term frequency uses the 
raw count of a term in a text. The inverse document fre- 
quency is a measure of how much information the word pro- 
vides. Some common words like ”is” or ”that” do not pro- 
vide much information but they do usually have a high term 
frequency. Those words should have less inverse document 
frequency(idf). We can calculate the value as: 


Total number of documents 


idf (t) = In( ) @ 


Number of documents with term t in tt 


In our preliminary analysis we found that the matrices per- 
formed poorly in classification due to the fact that both 
were extremely sparse. We therefore opted to compress them 
so that they can be compatible with the dense feature ap- 
proaches. To that end we built a Naive Bayes model [9] 
using the tf-idf sparse features and then use the predictions 
features. From this model we generated five shallow predic- 
tion features which correspond to the probability that the 
question belongs to each category. We followed this same 
approach with the word count vector and used those fea- 
tures as probabilities. The final list of extracted features is 
shown in Table 5. 


2.5.1 Model Training 

We trained our classification models using LightGBM [11], 
a Gradient Boosting Decision Tree (GBDT) algorithm pro- 
vided by Microsoft. GBDT is an ensemble model of de- 
cision trees trained in sequence. In each iteration, GBDT 
learns the decision trees by fitting the negative gradients 
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Table 4: Text meta features list 


Feature name Explanation value example 
length number of words in the problem description 12 
character number of characters in the problem description 12 
stop words number of stop words in the problem description 2 
punctuation number of punctuation in the problem description 0 
uppercase number of uppercase words in the problem description 0 
Table 5: Text content features lists 
Feature name Explanation value example 
prob-tf-useless the probability of this tickets belong to category ”useless” using tf-idf 0.75 
prob-tf-ins the probability of this tickets belong to category ”insufficient” using tf-idf 0.82 
prob-tf-suf the probability of this tickets belong to category "sufficient” using tf-idf 0.77 
prob-tf-error the probability of this tickets belong to category “copied error” using tf-idf 0.99 
prob-tf-test the probability of this tickets belong to category ”test” using tf-idf 0.65 
prob-cnt-useless the probability of this tickets belong to category ”useless”using using common word count 0.75 
prob-cnt-ins the probability of this tickets belong to category “insufficient” using common word count 0.82 
prob-cnt-suf the probability of this tickets belong to category "sufficient” 0.77 
prob-cnt-error the probability of this tickets belong to category "copied error” common word count 0.99 
prob-cnt-test the probability of this tickets belong to category ”test” common word count 0.65 


(also known as residual errors). To reduce the complexity of 
GBDT, LightGBM utilize two novel techniques to improve 
the algorithm: Gradient-based One-Side Sampling and Ex- 
clusive Feature Bundling. This method also utilizes a Leaf- 
wise Tree Growth algorithm to optimize the accuracy of the 
model and it applies a max depth of the trees to overcome 
the over-fitting problem that it might cause. Further, it 
optimizes the speed of training by calculating the gain for 
each split and uses histogram subtraction. LightGBM is 
known for its outstanding performance and relatively good 
speed. Thus, many researches applied this method to ma- 
chine learning tasks. 


The implementation code for LightGBM was provided by 
Microsoft[11] in 2013 and we are utilizing its Python li- 
brary for modeling process. We applied features and the 
label of training data by LightGBM to train a model, and 
fit that model on the testing data to predict each question 
in those data. By calculating the accuracy of the predic- 
tion, we can evaluate the performance of this model. We 
ran a series of 20 preliminary experiments to explore the 
space of parameters before we settled on the values listed 
in Table 6. A list of crucial parameters people generally 
need to tune to improve classification model performance is 
also in Table 6. Since our goal is to achieve better Accu- 
racy, we will tuning toward larger max_bin, smaller learn- 
ing_rate with larger num_iterations, larger num_leaves and 
larger max_depth each experiment until the accuracy is not 
improving. 


2.5.2 SMOTE 

During the modeling process, another issue we faced is that 
the categories are highly imbalanced. Over half of the ques- 
tions are in the Insufficient category and the Copied Er- 
ror category contained fewer than one percent of questions. 
To address the problem, we applied SMOTE method which 
over-samples examples in the minority class. SMOTE [5], 
first selects one minority class instance at random, create a 
synthetic instance by choosing one of the k nearest neigh- 


bors at random and connecting those two instance to form 
a line segment in the feature space. We applied this method 
with k=5 and oversampling the data to generate the training 
datasets and testing datasets for further model training. 


2.6 RQ3: Model stability over semesters 


For a trained model to be useful however, it must be stable 
across semesters or else we suffer from a cold-start problem 
[3]. In order to assess the model stability we ran a series 
of experiments where we assessed the relative utility of the 
models by applying a leave-one-out validation strategy on 
a semester-by-semester basis. Showing that all models per- 
form at a comparable level provides a strong indication that 
the models themselves are consistent and useful, even early 
in the semester. 


2.7 RQ4: Online Office hour analysis 

In Fall 2020, all the office hours were held online, which pro- 
vided valuable data about online office hours interactions. 
We are very curious to analyze and see whether the stu- 
dents behavior changed with the move to online office hours 
and if we should keep some online office hours sessions once 
we resume in-person instruction. 


We first analyzed whether the online session attracted more 
students to seek help during office hours. For an in-person 
session, students need to physically find the teaching staff 
in the office and physically stay in line. For online sessions, 
students only need to click the link to join the meeting with 
teaching staff. With online office hours, the friction of phys- 
ically going to a campus location has been removed. How- 
ever, online office hours have additional overhead in creating 
a connection between parties and transitioning between stu- 
dents. To better understand online office hour help-seeking, 
we calculated the average number of tickets per student 
and the percentage of students who used office hour in each 
semester and compared earlier semesters with in-person of- 
fice hours to the Fall 2020 semester with online office hours. 


86 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


Table 6: LightGBM common training parameters and the final optimal value after tuning our model 


parameter meaning final optimal value 
num_leaves number of leaves in one tree 1000 
max_depth Specify the max depth to which tree will grow. 10 
max_bin max number of bins to bucket the feature values. 150 
learning_rate learning rate of gradient boost 0.1 
num_iterations number of boosting iterations to be performed 32 
num_class number of classes. used only for multi-class classification 5 


Table 7: Distribution of labeled questions on those five cat- 
egories in all seven semesters 


Useless Insufficient Sufficient Copied Error Test 


3.01% 69.02% 12.04% 0.10% 15.83% 


However, online office hours could have an impact on the effi- 
ciency of communication. When teaching staff and students 
meet online, it creates more challenges for teaching staff to 
indicate problems to the students and to help explain why 
their code is failing. Screensharing allows the staff to view 
the students’ work, but physical interactions like pointing to 
a portion of the screen to indicate which button to click is 
lost. The teaching staff member needs to verbally describe 
the debugging process and ask students to follow it. There- 
fore, we calculated the interaction time and the wait time 
for each ticket. The we compare the distribution of inter- 
action time and wait time of F20 tickets with the rest of 
tickets. Additional overhead is incurred when connecting to 
a meeting. There is a lag when a student joins a meeting for 
their audio to set up to start the conversation. 


3. RESULTS 
3.1 RQ1: Categorization Results 


Table 7 shows the distribution of question categories across 
our dataset. As the figure shows, the most common cate- 
gory is Insufficient which occupies over 69 percent of the 
questions. The Test category coming next at 15 percent. 
Approximately 12 percent of the questions belong to the 
Sufficient category while 3 percent were rated as Useless. 
Surprisingly, despite comments from the teaching staff, the 
least common category was Copied Error with at most 10- 
15 questions per semester falling into this group. As our 
results show, the students tended to use the system primar- 
ily as a way of getting in line and typically provided little 
useful information for the teaching staff. These results also 
highlight the significance of testing tasks for the assignments 
and for students’ help-seeking given the high proportion of 
help tickets that are triggered by them. 


To assess the stability of these results we also examined the 
frequencies within each semester. Figure 3 shows this break- 
down. We found that the relative distribution is generally 
similar across semesters while the absolute percentages vary. 
In more recent semesters the students have authored more 
Sufficient tickets than in prior years suggesting that there 
has been greater effort by the instructional staff to encour- 
age good communication. Yet the persistence of the other 
ticket type suggests that automatic classification and triage 


Table 8: Average interaction time(in minutes) and standard 
deviation of each semester 


Useless Insufficient Sufficient 
AVG 19.7 21.9 18.3 
STD 125.3 237.0 103.6 
Copied Error Test 
AVG 11.5 24.8 
STD 67.5 208.2 


remain an important feature. 


Table 8 shows the average and standard deviation of inter- 
action time (the difference between when the interaction be- 
gan and it was closed) of each category across the semesters. 
For this calculation we did not consider tickets with an in- 
teraction time less than 10 seconds in length or which were 
longer than one hour. Our discussion with teaching staff 
and the instructors showed that the former were cases that 
were never seen as the student set a placeholder but fixed 
their problem before their turn came up or changed their 
mind, while the latter represents cases where the teaching 
staff offered help but did not close the ticket, often until 
well after the tutoring session was over. The Useless, In- 
sufficient and Sufficient categories averaged around twenty 
minutes in length with no meaningful difference in their in- 
teraction times. The Copied Error category was slightly 
shorter on average which may reflect the specificity of the 
students’ problems while the Test category had a slightly 
longer average interaction time. This may indicate that this 
kind of question is more complex or more substantive rel- 
ative to the others. Overall these results indicate that the 
amount of information provided does not necessarily affect 
the speed with which the issue can be addressed. 


3.2 RQ2: Modeling Results 


To evaluate the performance of our model, we trained the 
model using the first six semesters’ data and tested it on Fall 
20 data.The training dataset applied SMOTE method to 
oversampling the minority categories and result in each cat- 
egory having the same amount(5037) of questions in training 
dataset. The model achieved an overall accuracy of 91.8%. 
We then conducted a more detailed analysis of the perfor- 
mance for precision, recall, and F-score on each question 
type. The results are in Table 9. As our results show 
the model is relatively balanced across the categories with 
the exception of the Test category which had substantially 
higher precision and lower recall. This indicates that it was 
far more likely for other categories to be erroneously classi- 
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Figure 3: Frequency of each category for each semester 


fied as Tests when the student submitted other information. 
One comment that triggered this error was: ”Jenkins errors”. 


Table 9: Precision, Recall and F-measure for each category 


Fall 2020 semester had the highest number of average tickets 
per student and the largest percentages of students utilizing 
office hours. Additionally, the average tickets per student 
in Fall 2020 is nearly twice the average tickets per students 
in Fall 2019. This suggests that online office hours supports 


Useless | Insufficient [| Sufficient | Test | AVG _ increased student participation in office hours interactions. 

Precision 0.902 0.915 0.913 | 0.967 | 0.922 However, this increment could be explained by other fac- 
Recall 0.901 0.903 0.905 | 0.877 | 0.901 tors. The data shows that in general more students take 
F-measure 0.896 0.894 0.901 | 0.914 | 0.901 advantage of the help each year. This is consistent with the 


3.3. RQ3:Leave One out Results 


Having shown that the model is relatively balanced across 
categories we then analyzed the relative accuracy of the 
model on each semester. The results are shown in Table 
10. With the exception of Fall 2019, the model achieved 
an accuracy of at least 0.91 across each semester. Fall 2019 
was the largest and busiest semester in our training dataset 
which may indicate that students were more diverse in their 
habits or posting behavior but even still we achieved an ac- 
curacy of 0.899. In light of these results we believe that our 
modeling method is sufficiently stable to assist in processing 
unseen semesters without a cold-start problem. 


Table 10: Leave one out accuracy result 


left-out F17 S18 F18 $19 F19 $20 
semester 
accuracy 0.913 0.915 0.908 0.907 0.899 0.912 


3.4 RQ4: Online Office Hour Analysis Results 


Table 11 shows the various summary measures associated 
with office hours interactions for the semesters studied. The 


increasing class sizes but it is important to note that there 
are other patterns as well such as regular dips in each spring. 
Overall it serves to highlight the need for better course man- 
agement. Secondly, since the course lecture also holds online 
in F20, the increase of office hour usage supports a general 
expectation that students are facing additional challenges 
with online classes however we still believe that online office 
hours are a practical means to minimize the cost of help- 
seeking and thus encourage more students. 


The distribution of interaction times shown in Figure 4 in- 
dicates that the teaching staff generally took slightly longer 
to support students in Fall 2020 than other semesters. The 
median interaction time of Fall 2020 is 8.78 minutes while 
other semesters are 8.17 minutes. We believe that this is 
caused by the inconvenience of remote instruction and de- 
bugging. Additionally, the high percentages of interactions 
within one minute in all semesters are usually caused by 
teaching staff forgetting to open the tickets when the inter- 
action begins. In Fall 2020, the percentage of short tickets 
was much higher suggesting the teaching staff were more 
likely to make such mistakes because they are working with 
both MDH and the online interaction tool. After notify the 
student, the teaching staff has to waited in the zoom until 
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Table 11: Statistic Analysis for each semester 


saan Median 


F17 $18 F18 $19 F19 $20 F20 
total tickets 1146 609 1224 860 1650 1401 3452 
total students 208 157 259 174 256 191 303 
students use office hour 104 63 141 84 158 108 209 
average tickets per student 5.51 3.88 4.73 4.94 6.44 7.34 11.40 
percentage of students using office hour 50.0% 40.1% 54.4% 48.3% 61.7% 56.5% 69.0% 
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Figure 4: Histogram of Interaction time (time difference between open time and close time) for tickets in (a) Regular semesters 


and (b)Fall 20 
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Figure 5: Histogram of wait time (time difference between open time and raised time) for tickets in (a) Regular semesters 


and (b)Fall 20 
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the student actually join the zoom meeting to start the in- 
teraction. It is very common for the teaching staff just start 
the interaction in zoom without open ticket in MDH. The 
wait time in Fall 2020 is much longer than regular semesters, 
as shown in Figure 5. While we had 37 hours of regular office 
hour time each week for 17 TAs and 2 instructors, the in- 
creased demand in office hour help-seeking did have impact 
on student wait time and throughput. We additionally ob- 
served an increase in the number of canceled tickets. While 
online office hours lowers barriers for student attendance, 
additional resources to support demand are needed to en- 
sure timely support. As we transition back to in-person 
instruction, the course instructor will continue to offer some 
online office hours to support access to help-seeking. 


4. CONCLUSIONS 


The results of our analysis in RQ1 show that students’ use 
of the MDH platform does vary substantially from the de- 
velopers’ intentions. Far from using it to write complex help 
requests many do use it merely to reserve a place in line. Of 
the five categories, our labeling showed that most of the help 
tickets submitted lack sufficient detail to be clear what the 
student is asking about while those that do provide detail 
are most commonly focused on test cases which constitute 
a major feature of the class. Contrary to our initial expec- 
tations, the students rarely use the system to enter specific 
error messages, even if they have them. Thus the teaching 
staff have relatively little to go on when triaging questions. 
Clearly the proportion of useful information in the tickets 
increased in more recent years of the course but insufficient 
detail remains the most common feature. Despite this how- 
ever, our results also show that there are clear categories of 
use that we can build upon to assist teaching staff. And our 
results show that it may be possible to extend the system 
with minimal automated interventions such as detectors for 
word counts or grammar that can be used to scaffold, or 
simply enforce, good posting behavior. 


Informed by our analysis, we were able to address our mod- 
eling questions, RQ2 and RQ3 by developing accurate and 
robust classification models that achieved an overall accu- 
racy of 92.6% and individual accuracy of 0.899% to 0.91% 
. Moreover, the results are robust on a per-category basis. 
While these results are not perfect, they show that we have 
the potential to use models of this type for effective triage of 
student questions as well as to provide scaffolding and im- 
mediate guidance for students as they author help tickets. 
While such guidance has not been evaluated for its’ educa- 
tional impact prior work on self-explanation (e.g. [23]) leads 
us to conclude that it may help students to diagnose their 
own challenges. 


For RQ4, the comparison between an online session semester 
and regular semester shows both the strength and weakness 
of online office hours. The advantages of hosting office hours 
online is that it can encourage students to utilize the help- 
seeking resources; However, the large amount of help-seeking 
requests can be overloaded for teaching staff and the remote 
debugging through screen sharing is clearly less efficient than 
face-to-face interactions. 


5. LIMITATIONS 


There are several limitations to our work that must be ac- 
knowledged. While our results span semesters, they are still 
taken from a single course with a single instructor. As a 
consequence our results are necessarily dependant on the 
training that students have received and it is not yet clear 
whether this stability will be apparent in models created 
from interaction data for other courses, particularly those 
that are not as large, do not use the same assignment struc- 
ture, or rely so heavily on tests. 


Additionally, for our analysis toward online office hour, we 
did not consider the influence of teaching lectures online 
could raise more challenges for students and thus increase 
the usage of office hour. Our conclusion of online office hour 
encourage students to seek help is based on the assumption 
that there is no significant difference of academic difficulty 
between F20 and other semesters. 


6. FUTURE WORK 


This research can support future instructors in course man- 
agement and the automatic categorization for MDH system. 
We therefore plan to address these limitations, expand our 
dataset, and build upon the models that we have obtained. 
First, we plan to conduct a more robust process of tagging 
and classifying our tickets with the goal of assessing the sta- 
bility of our categories with other evaluators and of identify- 
ing other important ways of grouping the tickets themselves. 


Second, we will extend My Digital Hand to take advantage 
of these trained models in supporting both the students and 
instructors. We will support instructors by providing auto- 
matic triage approaches that can help to guide their plan- 
ning. And we will use automated guidance to prompt stu- 
dents to produce better tickets in the first place. 


Third, we also plan to investigate other aspects of the office 
hours that are captured in the MDH data. These include: 
whether students in the same office hours post similar tick- 
ets, thus highlighting the potential of peer feedback; and the 
presence or absence of serial ticketers; that is students who 
keep multiple follow-up tickets going to monopolize support. 
We plan to build models for these features with the goal of 
understanding how help time is being used and by extension 
how to better coordinate limited support. 


Finally, we plan to apply our models to provide automated 
scaffolding for students when they provide insufficient com- 
ments or errors. Specifically, we will integrate this model 
to the MDH system and every time a student raise a hand, 
we will use our model to predict the question category. If 
their description is insufficient or useless, then we can im- 
mediately notify them to revise it. This will help students 
to better frame their questions, and it will help the teaching 
staff can be better prepared to answer the students’ ques- 
tion. This initial filter can be followed by additional models 
to suggest debugging steps or common answers based upon 
their revised question. 
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ABSTRACT 


Recent years have seen significant interest in multimodal 
frameworks for modeling learner engagement in educational 
settings. Multimodal frameworks hold particular promise for 
predicting visitor engagement in interactive science museum 
exhibits. Multimodal models often utilize video data to capture 
learner behavior, but video cameras are not always feasible, or even 
desirable, to use in museums. To address this issue while still 
harnessing the predictive capacities of multimodal models, we 
investigate adversarial discriminative domain adaptation for 
generating modality-invariant representations of both unimodal and 
multimodal data captured from museum visitors as they engage 
with interactive science museum exhibits. This approach enables 
the use of pre-trained multimodal visitor engagement models in 
circumstances where multimodal instrumentation is not available. 
We evaluate the visitor engagement models in terms of early 
prediction performance using exhibit interaction and facial 
expression data captured during visitor interactions with a science 
museum exhibit for environmental sustainability. Through the use 
of modality-invariant data representations generated by the 
adversarial discriminative domain adaptation framework, we find 
that pre-trained multimodal models achieve competitive predictive 
performance on interaction-only data compared to models 
evaluated using complete multimodal data. The multimodal 
framework outperforms unimodal and non-adapted baseline 
approaches during early intervals of exhibit interactions as well as 
entire interaction sequences. 


Keywords 


Museum learning, visitor engagement, adversarial domain 
adaptation, early prediction, multimodal learning analytics. 


1. INTRODUCTION 


Visitor engagement is critical in museum learning [21]. 
Engagement defines how visitors experience museums, including 
how they move between exhibits, form and express interests, and 
acquire knowledge and understanding. Developing computational 
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models of museum visitor engagement holds significant promise 
for identifying salient patterns of visitor behavior as well as 
providing insight into how specific exhibits can be designed to 
enhance engagement. For example, visitor analytics show potential 
for enabling adaptive learning experiences tailored to the 
preferences and tendencies of the visitors, leading to highly 
engaged interactions with the exhibit. Visitor interactions with 
museum exhibits are inherently multimodal. Visitor engagement 
manifests through a variety of behaviors such as facial expression, 
touch, eye gaze, and body posture. As such, multimodal learning 
analytics can model museum visitor engagement by capturing and 
analyzing visitor behavior from several different perspectives [2, 
16]. Multimodal models of learner engagement have been shown to 
be effective in a range of environments, including laboratory [8, 22] 
and classroom settings [1, 6, 7]. More recently, multimodal 
learning analytics have been the subject of growing attention in 
informal education settings, such as museums [16, 20], but this line 
of investigation is still in its nascent stages. 


Given the multimodal nature of visitor interactions in museums, the 
use of multichannel data provides important benefits for modeling 
visitor engagement. In particular, multimodal models can be used 
to predict visitor engagement early during a visitor’s interaction 
with an exhibit [16]. This shows promise for enabling visitor- 
adaptive technologies that provide adaptive support for fostering 
engaged learning experiences with an exhibit or for notifying 
museum educators to inform decisions about staffing the museum 
floor. In predictive modeling, it is important that the multimodal 
visitor engagement models be evaluated in terms of both predictive 
accuracy and the minimum amount of time that the models require 
to achieve robust predictive performance. 


Multimodal modeling of visitor engagement in museums also poses 
significant challenges. Interactions with exhibits are highly variable 
due to the free-choice nature of museum learning [12, 25, 28]. 
Additionally, multimodal frameworks often utilize physical sensors 
(e.g., video cameras, motion sensors, eye trackers), which introduce 
questions about scalability, privacy, and mistracking. Intrusiveness 
is also a concer, as suites of multimodal sensors may be 
impractical in some settings, or they may adversely affect the 
natural behavior of visitors [32]. 


Transfer learning presents itself as a natural solution to this issue, 
as the various modalities in a multimodal modeling framework 
share a common predictive task. In particular, recent years have 
seen an increased emphasis on domain adaptation, a type of transfer 
learning that investigates the predictive capacity of models that are 
pre-trained on one domain (source domain) and are subsequently 
reweighted to perform similarly on another domain with a different 
distribution (target domain) across a single common task [39]. A 
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primary objective of domain adaptation is to obtain a domain- 
invariant representation of the salient features extracted from the 
two distinct data sources, where the shared feature space allows for 
improved predictive performance on data points from the target 
domain while still maintaining strong performance on data from the 
source domain. Examples of recent domain adaptation research 
include adapting across images extracted from different domains 
(34, 42] or across modalities captured from different data channels 
such as RGB-to-depth image translation [33, 42]. 


In this work, we investigate the use of domain adaptation as a 
method of translating unimodal, interaction-based data to a 
domain-invariant representation that can be used with predictive 
models previously trained on multimodal data. We demonstrate the 
effectiveness of a multimodal domain adaptation framework for 
making early predictions of visitor dwell times at an interactive 
museum exhibit. Our multimodal analytics framework is designed 
to operate in museum settings where sensor-based data capture may 
be restricted or otherwise impractical. We adopt an adversarial 
approach to generating domain-invariant representations of 
multimodal data (exhibit interactions and facial expression serving 
as the source domain) and unimodal data (exhibit interactions 
serving as the target domain) that are encoded using stacked 
denoising autoencoders. Empirical results of evaluations of the 
framework suggests that the use of adversarial discriminative 
domain adaptation allows for a unimodal target encoder to be 
trained to share a latent feature space with a multimodal source 
encoder [42]. The framework achieves higher performance than an 
interaction-only baseline model in terms of early prediction and 
visitor-level prediction of dwell time, a proxy indicator of visitor 
behavioral engagement with an exhibit. Dwell time has been 
frequently used to quantify visitor engagement in museum settings 
[5, 23]. The framework offers the potential to accurately predict 
visitor dwell time in museums, while also allowing for operation 
with reduced availability of physical sensor data, or even when no 
physical sensor data is available. 


2. RELATED WORK 


Visitor engagement is a critical aspect of learning in informal 
learning environments, such as science centers and museums [21]. 
Engagement shapes how visitors proceed throughout a museum, 
and interact with various exhibits [16]. There has been substantial 
work on modeling engagement in formal learning environments 
such as classrooms [19] and laboratories [8], and this focus has 
expanded in recent years to informal learning environments. This 
includes research efforts focused on analyzing engagement in 
groups of visitors around interactive tabletop exhibits [5], 
investigating the effectiveness of interventions for enhancing group 
engagement at different diorama exhibits [23], and predicting 
visitor dwell time [16]. However, devising computational models 
of museum visitor engagement remains a relatively unexplored area 
and presents distinctive challenges due to the free-choice nature of 
visitor learning in museums, creating a need for data-rich 
engagement modeling techniques. 


Multimodal engagement modeling has shown significant promise 
as an engagement modeling approach due to its capacity to provide 
a data-rich multi-dimensional perspective on learner behavior [2]. 
In many cases, multimodal models lead to improved performance 
compared to models that utilize a single modality [19, 22, 32, 49]. 
Multimodal models have often utilized several diverse data 
channels when deployed in formal learning environments, 
including facial expression, posture, eye gaze, dialogue, and 
interaction trace data [40]. Facial expression data is commonly used 
in multimodal learner models of student affect [7] and performance 


[44]. Posture data has also been used for affect detection [22] as 
well as predicting learners’ levels of engagement with Massive 
Online Open Courses (MOOCs) [9]. Eye gaze data has been 
combined with facial expression and head pose data to train models 
for continuous emotion prediction [48], while dialogue data has 
been utilized to predict dropout in online K-12 courses [26]. 
Finally, interaction trace logs and keystroke data have been used in 
conjunction with facial expression data to detect confusion in 
students engaging with an introductory computer science education 
learning environment to provide adaptive feedback and support 
dynamic adjustment of exercise difficulty levels [6]. While recent 
work has investigated multimodal approaches to modeling visitor 
engagement in museums [16], multimodal approaches to museum 
visitor modeling poses significant challenges, as these frameworks 
often necessitate physical, sensor-based data capture. This 
introduces various ethical and logistical concerns and may be 
impractical or prohibitive in certain informal learning 
environments. 


Computational methods such as transfer learning, and particularly 
domain adaptation, provide a way to harness the predictive 
capacities of multimodal learning analytics while allowing visitor 
modeling frameworks to operationalize a reduced number of more 
intrusive modalities. Domain adaptation and transfer learning have 
shown significant potential in a variety of implementations, and 
have been utilized within educational contexts for tasks such as 
confusion detection in online forums for different online courses 
[50] and automated essay scoring across different prompts [35]. 
Additionally, domain adaptation has been investigated within 
multimodal contexts such as RGB and depth images [42], as well 
as video and audio modalities [36]. To our knowledge, adversarial 
domain adaptation has not been applied to unimodal and 
multimodal data to model learner engagement in museums. 


Recent domain adaptation work has focused primarily on an 
unsupervised or semi-supervised variation of this problem, where 
deep learning models trained on a labeled source dataset are 
transferred to share latent representations alongside a target domain 
that may contain little or no previously labeled data. The issue of 
missing labels for the target domain data is addressed by obtaining 
a domain-invariant representation through minimizing the distance 
between the learned data representations between the two domains 
[17, 41, 42]. While prior efforts accomplish this task through 
statistical measures such as the Maximum Mean Discrepancy 
(MMD) [43] or the deep Correlation Alignment (CORAL) [39], 
other work has taken an adversarial approach, with the 
simultaneous goals of learning a data representation that is 
predictive of the source domain labels while also being 
indistinguishable to a domain discrimination model [27, 42]. One 
approach involves reversing the gradients of a domain 
discrimination model to maximize the model’s loss and guide the 
learning to explore a domain-invariant representation [17]. Other 
approaches train a source encoder to reduce the source domain data 
to a latent representation and use a domain discriminator to 
adversarially train a target encoder to produce a_ latent 
representation of the target domain data that is indistinguishable to 
the discriminator [42]. The trained target encoder is subsequently 
used to process unlabeled data from the target domain to be 
classified by a model pre-trained on source data. Another approach 
is the Co-GAN approach, which involves two Generative 
Adversarial Networks (GANs) that generate source and target data, 
respectively [27]. The high layer parameters of the two GAN 
models are tied together, allowing the generators of the models to 
co-learn a shared latent representation while possibly sharing a 
common input noise vector. 
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Early prediction is an important component of visitor modeling 
because it can drive run-time adaptive support to enhance visitor 
interest and engagement with interactive exhibits. A critical 
objective in early prediction is to reach a certain accuracy threshold 
in a timely manner. Early prediction has been investigated in the 
context of formal learning environments, such as predicting 
middle-grade learner engagement with a game-based learning 
environment [47], evolving learning goals throughout students’ 
interaction trajectories [31], and student success in novice 
programming tasks [29]. Early prediction has also been the subject 
of prior work on museum learning, such as automatic detection of 
visitors’ social behavioral patterns [13, 24] and multimodal 
regression-based modeling of visitor engagement in science 
museums [16]. 


The primary contributions of this work are as follows: (1) we 
demonstrate improved predictive performance of multimodal 
models of museum visitor dwell time using facial expression and 
interaction data compared to interaction-only baselines, (2) we 
evaluate the effectiveness of adversarial discriminative domain 
adaptation as a means of enabling the use of previously-trained 
multimodal models with unimodal data, and (3) we investigate the 
performance of each visitor engagement model using convergence- 
based early prediction metrics and standard predictive performance 
measures. Domain adaptation has been relatively underexplored 
with educational data, and this is especially true of data from 
informal learning environments such as museums. Furthermore, 
there has been limited work investigating domain adaptation in the 
context of early prediction of learner engagement. Our work shows 
that domain adaptation is effective at enhancing prediction of 
visitor dwell time by harnessing the capacities of multimodal 
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visitor modeling, which leads to higher predictive accuracy when 
compared to unimodal models. 


3. FUTURE WORLDS EXHIBIT 


To investigate multimodal predictive models of museum visitor 
engagement, we use data collected from visitor interactions with a 
game-based museum exhibit, FUTURE WORLDS, which is designed 
to introduce visitors to concepts about environmental sustainability 
(Figure 1). FUTURE WORLDS runs on a multi-touch display, 
enabling visitors to interact with the virtual environment through 
touch and gestures on the screen. Visitors are faced with the 
challenge of improving the conditions of the virtual environment’s 
biosphere through a series of changes such as farming practices and 
energy sources within the game. FUTURE WORLDS and its integrated 
educational content are targeted towards learners ages 10-11. 


Visitors can tap or swipe on the screen to perform certain actions 
such as reading about a particular aspect of the virtual environment 
and its impact on sustainability or modifying an in-game element 
and observing the broader consequences of this decision on the 
environment. Upon making a change to the virtual environment, the 
visitor is given immediate feedback regarding the positive or 
negative impact of the gameplay action. A visitor can “win” by 
making the correct decisions to certain in-game elements that 
maximize the environmental sustainability of the virtual 
environment. Afterwards, the visitor is presented with the option to 
restart the game or continue interacting with the virtual 
environment in its completed state. Additionally, a visitor is able to 
leave the FUTURE WORLDS exhibit having not completed the game 
beforehand. Prior work with FUTURE WoRLDs found that visitors 
improved their understanding of environmental sustainability 


Poorly Managed Hog Farm 
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Figure 1. Gameplay of the FUTURE WoRLDs interactive exhibit, including (A) 3D virtual environment, (B) selecting an element to 
modify, (C) viewing information about the selected element, and (D) correctly solving the in-game problem. 
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concepts, while also demonstrating high levels of engagement 
throughout their interactions with the exhibit [37]. 


4. MULTIMODAL DATA COLLECTION 


To track visitor engagement and behavior with FUTURE WORLDS, 
the exhibit was instrumented with several sensors to collect the 
real-time behavior of visitors’ interactions with the exhibit, as 
shown in Figure 2. We first describe the visitor population for study 
participants and then introduce the two modalities used for the 
domain adaptation approach (facial expression, exhibit interaction 
trace logs), and the features extracted from each input data channel. 


Facial Expressions 


af 


\. 


Exhibit Interaction Logs |g al rh 


Figure 2. Visitor interacting with FUTURE WORLDS. 


4.1 Study Participants and Procedure 

We conducted a study of visitor interactions with the FUTURE 
WORLDS exhibit at the North Carolina Museum of Natural Sciences 
in Raleigh, North Carolina. The data were collected over a series of 
three sessions with different school groups of visitors aged 10-11 
(M=10.4, SD=0.57). The school groups came from different socio- 
cultural backgrounds (e.g., race/ethnicity), and each school served 
student populations where 70% of the students are from low- 
income families. In total, 116 visitors interacted with FUTURE 
WORLDS. There were 47 female and 55 male participants, with 14 
participants who did not provide data on their gender. The visitors 
were 32.4% Hispanic or Latino, 21.6% African American, 11.8% 
American Indian, 8% Asian, 7.5% mixed race, 3% Caucasian, and 
15.7% preferred not to respond. Before interacting with the exhibit, 
visitors were asked to complete a series of surveys and 
questionnaires, including a demographic survey, sustainability 
content knowledge assessment, and the Fascination in Science 
scale [11]. Afterward, visitors interacted with the exhibit until they 
wanted to stop or after approximately 12 minutes had elapsed 
(M=5.8, SD=2.4, Min=1.8, Max=11.8). Visitor dwell times were 
captured by the game’s internal logging functionalities. Once 
visitors finished their interaction with the exhibit, they were asked 
to complete a sustainability content knowledge assessment, 
engagement survey, and a short debrief interview. Several visitors 
were missing one or multiple data channels (e.g., facial 
mistracking), requiring the removal of their data from the final 
dataset for analysis. The final dataset that was used for the 
predictive models in this paper consisted of multimodal data from 
79 visitors. 


During the data collections, the visitors’ body movement, eye gaze, 
facial expression, and interaction data from the exhibit were 
captured. For this study, we focus exclusively on the exhibit 
interaction data and the facial expression data. We selected the 
exhibit interaction data due to its unintrusive nature and its relative 
ease of data capture, as the trace data is captured in the background 


with the exhibit software and does not require any physical sensors 
or calibration. We selected facial expression data because of its 
predictive utility in previous work on unimodal and multimodal 
models of learner engagement [14, 15]. 


4.1.1 Facial Expression 

Facial expression is an important indicator of learner emotion, and 
it has been widely used in previous studies on modeling learner 
engagement [46]. In this work, visitor facial expression was 
captured using video data from an externally mounted Logitech 
C920 USB webcam. In real time, the video data was processed by 
OpenFace, an open-source facial behavior analysis toolkit to detect 
facial landmarks, estimate head pose, recognize facial action units 
(AUs), and estimate eye gaze [3]. The OpenFace software 
automatically detects and analyzes 17 distinct AUs for each 
visitor’s face captured within the camera’s field of view. 


4.1.2 Interaction Trace Logs 

FUTURE WORLDS includes built-in logging functionalities to 
capture fine-grained logs of visitor interactions with the exhibit. 
The interaction trace logs consist of sequential records (at the 
millisecond level) of physical interactions with the multi-touch 
display (e.g., taps, swipes, and gestures), as well as specific in- 
game learning events (e.g., altering the virtual environment and 
accessing an embedded informational resources). The interaction 
trace logs are used to investigate how visitors interacted with the 
exhibit and progressed through the game. 


4.2 Multimodal Features 


Using both visitors’ facial expression and exhibit interaction 
behavior, we distilled two sets of features to serve as predictors of 
visitor dwell time. Many of the extracted features for each modality 
were chosen based on their predictive performance in prior work 
on multimodal learning analytics [16]. 


4.2.1 Facial Expression 

Using the processed AU data from OpenFace, we calculated the 
duration that each AU was exhibited throughout the visitor’s 
interaction with FUTURE WoRLDS. We first standardized each 
visitor’s observed AU intensity values and then calculated the 
duration of each AU during time intervals where consecutive AU 
intensity values were at least one standard deviation greater than 
the mean of that particular visitor-specific AU feature. This 
filtering process ensured that only spikes relative to the specific 
visitor’s AU values contributed towards the calculation of the total 
duration. To further filter the AU durations, we only recorded the 
duration if the AU was present for longer than 0.5 consecutive 
seconds. This avoided possible micro-expressions that could add 
noise to the overall data channel [38]. We performed this filtering 
process for all 17 AUs tracked by OpenFace. In addition, we 
generated the standard deviation and maximum AU values across 
the visitor’s interactions up to the current timestamp. In total, we 
extracted and distilled 51 facial expression-related features. 


4.2.2. Exhibit Trace Logs 

We distilled eight features from the exhibit interaction data: (1) the 
total number of times a visitor tapped the FUTURE WORLDS multi- 
touch display, (2) the total number of times a visitor tapped 
informational tiles about environmental sustainability concepts, (3) 
the total duration of time an informational tile was open, (4) the 
total duration spent with labeled sustainability images displayed 
onscreen, (5) the total duration of time that a visitor spent directly 
interacting with the 3D simulated environment in FUTURE WORLDS, 
(6) the total number of times a visitor swiped the interface to 
explore alternative options for modifying the simulated 
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environment, (7) the total number of times the simulated 
environment was modified, and (8) a binary feature that indicated 
whether a visitor had successfully solved the current environmental 
problem scenario in FUTURE WORLDs. 


5. DOMAIN ADAPTATION 


In this work, we present an unsupervised, adversarial 
discriminative domain adaptation approach that enables the use of 
multimodal visitor engagement models in settings where only 
unimodal data streams are available. In unsupervised domain 
adaptation, two datasets are extracted from two separate domains: 
(1) a source domain (s), from which data samples _X; and associated 
labels Y; are drawn, and (2) a target domain (f), which contains 
unlabeled data samples X;. It is also assumed that there exists a 
classifier C; that has been previously trained on the source data_X; 
and source labels Y; by learning a latent mapping M;. The primary 
objective of the unsupervised domain adaptation approach is to 
learn a latent mapping M; so that M(X;) can be correctly classified 
by C, despite the absence of any associated labels for X;. 


The purpose of adversarial training within the domain adaptation 
framework is to learn a domain-invariant data representation that 
minimizes the distance between MX;) and M,(X;). This is 
accomplished through a separate binary discriminator, D, that is 
trained to distinguish between latent representations of the source 
domain and the target domain. The discriminator is optimized 
according to a standard cross-entropy loss function (Equation 1): 


Lpisc (Xs, Xt, Ms, M, ) 
= —E,, ~ x, [logD (M, (X;))] (1) 
— Ey, ~ x, [log (1 - D(m.(X.)))| 


Adversarial domain adaptation focuses on two primary objectives 
implemented within a minmax framework: the discriminator 
attempts to accurately classify a latent data representation as either 
from the source domain or the target domain, while a target encoder 
attempts to learn a mapping M,(X;) that deceives the discriminator, 
thus finding a latent representation that is domain-invariant but 
retains enough salient characteristics to provide predictive value to 
a source classifier C;. To implement an adversarial loss function 
within the framework, a common practice is to simply invert the 
loss term when training the target encoder. This essentially reverses 
the gradients for the target encoder but can consequently lead to 
premature convergence and vanishing gradients [17]. A more stable 
training method is to invert the labels used to train the target 
encoder. This creates two distinct convergence objectives for the 
two elements of the adversarial framework [42]. The discriminator 
loss term remains the same as stated in Equation | above, while the 
loss term for the target encoder becomes: 


Lrar (Xs, X¢,D) = —E,, ~ x, [logD(M,(X;)) | (2) 


This process is analogous to the process utilized by generative 
adversarial networks (GANs) [18]. A GAN attempts to emulate a 
fixed data distribution by adversarially training a discriminator to 
distinguish between “fake” data, which was produced by a 
generator that aims to generate data that is synthetic but realistic 
looking using a random noise vector, and “real” data that is 
extracted from the prior fixed data distribution. While GANs have 
been utilized in domain adaptation tasks [27], they are typically 
effective when the source and target domains are relatively similar. 
GANs have shown convergence issues in scenarios involving a 
high degree of domain shift [42]. As our work involves a domain 
shift from a multimodal source domain to a unimodal target 
domain, we opt to utilize a non-generative approach for this work 
and focus exclusively on discriminative adversarial methods. It is 


assumed that a pre-existing distribution of multimodal data (i.e., 
interaction trace logs + facial expression) is available to train the 
source encoder and the source classifier, while the target 
distribution consists of unlabeled unimodal data (i.e., interaction 
trace logs). This is intended to simulate scenarios where visitor 
engagement models have been previously trained on multimodal 
data but are deployed in situations where only interaction trace log 
data is available to generate new predictions of visitor engagement. 


While much prior work in adversarial domain adaptation involves 
source and target domains of similar or identical dimensionality 
(e.g., image-to-image translation), the multimodal aspect of this 
work presents a distinct challenge, as the multimodal data in the 
source domain inherently contains more features than the unimodal 
target domain. To enable the pre-trained multimodal classifier to 
handle unimodal data as input, stacked denoising autoencoders [45] 
are used to reduce the multimodal and unimodal feature vectors to 
the same dimensionality. An autoencoder is an unsupervised 
method of using feedforward neural networks to reduce an input 
vector X to a latent data representation using an encoder that 
contains a mapping function M. The autoencoder then attempts to 
use a decoder that uses mapping function N to reconstruct M(X) to 
its original input. The encoder and decoder components of the 
autoencoder are both optimized by minimizing the reconstruction 
loss between X and M(M(X)). A stacked autoencoder is a variation 
in which each component contains multiple hidden layers of 
autoencoders. A denoising autoencoder builds on the same concept 
but corrupts the input vectors using random noise injection, which 
allows effective model regularization [45]. In this work, we use a 
corruption level of 0.25 on each feature in each input vector, where 
a value is set to 0 when the input feature is corrupted. After input 
vector X undergoes random noise injection to produce X’, the 
denoising autoencoder attempts to reconstruct X from N(M(X’)). 
This allows the autoencoder to become more robust against random 
noise within the input features while also preventing the 
autoencoder from overfitting or simply learning the identity 
function. Following the optimization of the autoencoder, the 
decoder component is discarded while the encoder component is 
retained for dimensionality reduction within our data processing 
pipeline. A denoising autoencoder is shown in Figure 3. 
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Figure 3. A denoising autoencoder. 


Our adversarial domain adaptation process is shown in Figure 4. 
Figure 4A illustrates the initial training of the classifier and the 
source encoder. The features from the facial expression and 
interaction modalities are concatenated together and then used to 
train a stacked denoising autoencoder. Following this process, the 
trained source encoder is then used to reduce the multimodal input 
data to a latent representation that is then used to train a classifier. 
The classifier receives the latent data as input and is trained to 
predict the target variable, visitor dwell time. To enable the 
adversarial training of the target encoder and discriminator (Figure 
4B), the weights of the pre-trained source encoder are fixed, and 
the target encoder weights are initialized using a pre-trained 
autoencoder optimized on the unlabeled, interaction-only data. An 
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Figure 4. Domain adaptation process, including (A) the 
classifier and source encoder training, (B) adversarial training 
of the target encoder and discriminator, and (C) evaluation of 
the adapted target encoder on the classifier. Dashed lines 
indicate fixed model weights. 


alternative approach is to initialize the target encoder weights from 
the source encoder. However, this can only be accomplished if the 
feature vectors extracted from the source domain are the same 
dimensions as the target domain. In our work, the multimodal 
feature vectors from the source domain have a_ higher 
dimensionality than the unimodal feature vectors from the target 
domain, since we remove the facial expression modality from the 
training data for the target encoder. The source and target encoders 
are used to produce latent representations of the multimodal and 
unimodal features, respectively. These representations are assigned 
labels of either 1 if the sample originated from the source domain, 
and 0 if the sample originated from the target domain. The 
data/label pairs are then used to train a feedforward network serving 
as the discriminator model. The discriminator is trained to 
distinguish between latent data from the source domain and from 
the target domain, while the target encoder is simultaneously 
trained to produce latent data from the target domain that 
consistently deceives the discriminator. To evaluate the target 
encoder (Figure 4C), unimodal data is passed through the encoder, 
and the resulting encoded data is then forward propagated through 
the trained classifier shown in Figure 4A. This procedure provides 
a way to evaluate the predictive performance of a multimodal 
classifier on unimodal data. It is important to note that some amount 
of multimodal data must be present prior to deploying our 
adversarial approach in order to train the multimodal classifier as 
well as the multimodal autoencoder. 


6. METHODOLOGY 


In multimodal models of learner engagement, some modalities that 
are highly predictive of engagement can also be impractical or 
undesirable in certain educational settings, such as sensors that 
require a cumbersome calibration process or expensive specialized 
equipment. Modalities that involve the capture of video data can 
raise concerns about privacy. However, eliminating physical 
sensors and exclusive reliance on sensor-free modalities may result 
in decreased performance on some tasks and settings. We propose 
a solution to this issue that (1) allows the predictive capacities of 
multimodal models to be retained, and (2) allows for the reduction 
in use of physical sensors. This work operates under the assumption 
that multimodal data is available in at least some capacity to 


facilitate the training of multimodal models prior to adversarial 
domain adaptation. As a result, the ideal setting for the proposed 
framework is after an initial multimodal data collection has taken 
place, enabling pre-trained multimodal models to be deployed. 
Below we describe the methods used to preprocess the multimodal 
and unimodal data, the feature selection process utilized to select 
the data used in the prediction and adversarial tasks, and the 
approach to training and validation of the visitor engagement 
models. Finally, we present the early prediction convergence 
metrics used to evaluate the final classification models and the 
domain encoders. 


6.1 Data Preprocessing 


6.1.1 Temporal Feature Engineering 

To facilitate early prediction of visitor engagement, sequential 
representations were produced from the features engineered from 
the two modalities as described in Section 4.2. To accomplish this, 
feature vectors were engineered for every subsequent 10-second 
interval in a single visitor’s interaction session with the exhibit. For 
each feature, the average or sum of all values from 0 to 10n 
seconds was calculated, where n is the number of 10-second 
intervals that have elapsed for that feature vector. For example, if a 
visitor engaged with the exhibit for one minute, then n=6, and the 
feature vectors are generated across time intervals of 10, 20, 30, 40, 
50, and 60 seconds from the beginning of their session. This allows 
each feature vector to be a representation of a visitor’s behavior 
over their entire interaction with an exhibit up to that point. 
Additionally, this approach solves the issue of the temporal 
alignment of the separate data channels caused by differing 
sampling rates of the facial expression modality and the interaction- 
based modality. As a result, the early prediction models are able to 
make predictions at a consistent frequency across every visitor’s 
exhibit interaction trajectory (i.e., every 10 second). To ensure that 
the additive nature of the features does not contribute to artificially 
inflated model performance, each feature is scaled by the elapsed 
time up to the current timestamp. After this process is complete, 
2,279 data samples were generated for 79 visitors. 


6.1.2 Visitor Dwell Time 

The beginning of a visitor’s dwell time takes place after a 
calibration process with the eye gaze sensor is completed, and prior 
to when they are presented with an on-screen information dialogue 
box explaining the problem to be solved. The visitor’s session can 
end one of three ways: (1) the visitor opts to end their session prior 
to completing the problem-solving task in FUTURE WORLDS, (2) the 
visitor solves the problem and chooses to end their session, or (3) 
the visitor solves the problem, opts to continue interacting with the 
virtual environment, and later chooses to end their session. Each 
visitor’s dwell time was captured in total seconds (M=268.83, 
SD=137.48, Min=77.11, Max=657.48) and was recorded by the 
FUTURE WORLDS exhibit’s built-in logging functionalities. For the 
purpose of this work, the dwell time prediction task was converted 
to a classification problem by splitting dwell time into three tertile 
groups and assigning approximately one-third of the visitors to 
each group. We use this classification approach instead of 
regression analysis due to the relatively low number of visitors in 
the dataset and to accommodate the use of early prediction 
convergence metrics. The visitors in the dataset were assigned to 
one of three possible groups according to their dwell time d: low (d 
<= 193.54, N=26), low (193.54 < d <= 318.82, N=27), and high (d 
> 318.82, N=26). We take this approach as a way to prevent a 
significant class imbalance while still retaining a higher level of 
granularity than a median split. The distribution of visitor dwell 
times, including the ternary groups, is shown in Figure 5. 
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Figure 5. Distribution of visitor dwell times and 
ternary groups. 


6.2 Feature Selection 

Because of the large number of features in the multimodal data (51 
facial expression features and 8 exhibit interaction features), we 
implemented forward feature selection to eliminate features with 
little or no predictive value and to reduce potential noise. Forward 
feature selection iterates through a list of features in a greedy 
manner, training a model on a single feature and continuing to add 
features if their inclusion increases the performance of the model 
on the target variable. This process continues until a predetermined 
number of features are selected or until all available features have 
been evaluated. This process has a few shortcomings. Due to the 
greedy nature of the algorithm, the features that are evaluated first 
have a higher chance of being selected. For example, the first 
feature that is evaluated is always retained, regardless of its true 
contribution to the predictive performance of the model. One 
approach to mitigating this issue is to perform forward feature 
selection for every possible combination of features, but this is 
often prohibitive as the number of combinations increases 
exponentially as the number of features increases, which imposes 
significant computational requirements. To mitigate the issue of 
bias in greedy feature selection while avoiding an exhaustive search 
across all feature combinations, we perform forward feature 
selection across a randomized ordering of all available features. We 
used a support vector machine (SVM) as the predictive model for 
each feature combination due to its effectiveness in high- 
dimensional spaces and relatively small computational overhead. 
This process was repeated for 100 separate iterations and 
randomizations to ensure that each feature had an equal probability 
of being placed at a specific point within each feature ordering. 
Following this process, the features were sorted according to the 
frequency that each feature was selected across all 100 iterations. 
To compensate for the difference in the number of features for each 
data channel, we performed forward feature selection on the facial 
expression modality and selected the ten most frequently selected 
features. 


It should be noted that because we selected the ten most frequent 
features from the facial expression modality, and the interaction- 
based modality contained only 8 total features, each feature from 
the latter modality was included in the data modeling process. (We 
perform forward feature selection on the interaction-based features 
for analysis purposes only.) Because certain features such as AU 
durations and tile durations increase monotonically throughout a 


Table 1. Most frequent features from forward feature 
selection (interaction) 


Feature Frequency 
Proportional Tile Duration 0.637 
Proportional Open Tile Count 0.561 
Proportional Info Duration 0.557 
Proportional Info Taps 0.554 
Proportional Taps 0.511 
Proportional Swipe Tiles Count 0.416 
Proportional Modify Tile Count 0.341 
Beat Game 0.272 


Table 2. Most frequent features from forward feature 
selection (facial expression) 


Feature Frequency 
AUOS Max 0.317 
AU10 Max 0.276 
Proportional AU10 Duration 0.257 
AU02 Max 0.237 
Proportional AUO1 Duration 0.227 
AU26 Std 0.218 
AU25 Max 0.214 
Proportional AU17 Duration 0.208 
Proportional AU45 Duration 0.206 
Proportional AU26 Duration 0.196 


visitor’s exhibit interaction trajectory and can lead to indirect data 
leakage with regard to the target variable (dwell time at the exhibit), 
the features were scaled by the total elapsed time up to the current 
timestamp, so these features were converted to proportional 
representations of the elapsed time at each time interval. 


This feature selection process took place within each cross- 
validation fold, and as a result, each fold produced a different 
combination of selected features. We calculated the frequency of 
the features across all cross-validation folds and present these in 
Table 1 and Table 2. 


Based on the results in Table 1, features related to general 
interactions (proportional number of times any tile was opened, 
proportion of time any tile was open) were the most predictive 
interaction-based features. The features related to opening and 
viewing embedded graphical and textual science materials were 
also frequently selected features. The features representing the 
frequency a visitor modified the in-game virtual environment were 
less frequently selected as predictive features, as was the binary 
indicator of whether the visitor correctly solved the problem at that 
particular timestamp. 


The most predictive features from the facial expression modality 
were primarily maximum values and proportional durations of 
certain AUs. AUOS (upper lid raiser) and AU1O (upper lip raiser) 
were the most predictive facial action units, followed by AU02 
(outer brow raiser) and AUO1 (inner brow raiser). AU25 (lips part) 
and AU26 (jaw drop) were moderately predictive, followed by 
AU17 (chin raise) and AU45 (blinking). Multiple representations 
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of AU10 and AU26 were frequently selected during the feature 
selection process as well. It is notable that the overall frequency of 
the facial expression features is significantly lower than many 
interaction-based features. This is likely a result of the large number 
of facial expression features compared to the interaction-based 
features. 


6.3 Model Evaluation 


The models were evaluated using 10-fold cross-validation, with the 
splits for each fold occurring at the visitor level to ensure that a 
visitor’s data was contained only within a single training, 
validation, or test set. The dataset was standardized within each 
cross-validation fold by dividing each feature by subtracting the 
feature’s mean and dividing by the feature’s standard deviation, as 
determined by the training data. This rescales the data to have a 
standard deviation of | (unit variance) while centering the mean to 
be 0. Following this process, class imbalances within the training 
data were resolved using Synthetic Minority Oversampling 
Technique (SMOTE) [10]. SMOTE is a common upsampling 
approach that resolves class imbalances through a randomized K- 
nearest neighbor approach, which brings the class balance to a 
uniform distribution while avoiding duplication of any data points. 
The upsampled, standardized training data is then used for forward 
feature selection as described in Section 6.2. 


After feature selection, a classifier model was trained on the 
multimodal data and the visitor dwell time labels in each cross- 
validation fold to provide a comparison point for the domain- 
adapted models. The tertile labels for the target variable were 
encoded as one-hot vectors for each model output. We evaluated 
five different models: SVM, logistic regression, naive Bayes, 
random forest, and a feedforward neural network. We performed 
hyperparameter tuning using a 3-fold nested cross-validation 
within the training set for each outer cross-validation fold. The 
hyperparameters that were varied for each model included the 
regularization parameter and kernel (SVM), regularization 
parameter (logistic regression), number of estimators (random 
forest), and number of layers and nodes (feedforward neural 
network). Additionally, the architecture of the autoencoder used to 
train the source encoder was evaluated during the hyperparameter 
tuning phase. The autoencoder was a feedforward neural network, 
and the hyperparameter values that were evaluated were the number 
of layers and nodes in the hidden layers within the encoder and 
decoder, as well as the number of latent dimensions. The 
feedforward neural network achieved the optimal performance as 
the classifier for visitor dwell time, using two hidden layers with 64 
nodes each. The source encoder contained three hidden layers with 
64, 32, and 16 nodes, respectively, with a latent output of 10. 
Additionally, all feedforward neural network models used a 
learning rate of 0.001, a dropout rate of 0.5 in the last hidden layer, 
and sigmoid activation functions. The loss function used for each 
model was categorical cross-entropy. Early stopping was 
implemented for each model using the validation data during the 
nested cross-validation to protect against overfitting. As a baseline, 
we follow the same process previously described, except using only 
the interaction modality. We evaluate both a unimodal and 
multimodal baseline in order to demonstrate the improved 
performance of the multimodal model of visitor dwell time as 
compared to the unimodal model, and to show the improved 
performance using the domain adaptation framework in situations 
where only unimodal data is available. 


After the optimal classifier and source encoder for adversarial 
domain adaptation were trained for each cross-validation fold, the 
models’ weights were fixed to evaluate the classifier performance 


on interaction-only data and to encode the multimodal data within 
the adversarial framework, respectively. The adversarial 
framework used a target encoder that is a feedforward neural 
network whose architecture and weights were pre-determined using 
the interaction-only baseline model. Although the source encoder 
and target encoder weights were not tied together as is common in 
other adversarial domain adaptation work [27], there was an 
imposed restriction that the latent dimensions be the same for both 
domains due to the fixed input size of the discriminator. The 
discriminator in the adversarial framework was a feedforward 
neural network with two layers of 64 nodes each. The learning rate 
of both the discriminator and the target encoder was 0.001, with a 
dropout rate of 0.05 in the last hidden layer and hyperbolic tangent 
activation functions. The loss functions for the discriminator and 
target encoder were based on binary cross-entropy as shown in 
Equations | and 2, respectively. The adversarial domain adaptation 
took place within each cross-validation fold to prevent data leakage 
from the test set. 


To evaluate the predictive performance of the domain-adapted 
representations of the target data, the trained target encoder was 
used to encode the interaction-only data from the held-out test set 
within each cross-validation fold, and the encoded data was passed 
to the classifier model trained with the source data. The predictive 
performance of the classifier on this data was used to confirm that 
the use of multimodal data to train the classifier induces higher 
performance than if the facial expression data was removed from 
the dataset entirely. As an additional baseline, the target encoder 
trained on the interaction-only modality was used to pass the 
encoded data directly to the multimodal classifier without the 
domain adaptation procedure, following the source-only baseline 
approach of Tzeng et al. [42]. This illustrates that any improvement 
due to our method can be attributed to the adjusted weights through 
the adversarial adaptation process instead of just compressing the 
latent representation of the target domain data to the source 
domain’s dimensionality. This specific baseline is called target- 
only. 


6.4 Early Prediction 

To quantify the models’ ability to accurately predict a visitor’s 
dwell time early and consistently, we utilize two metrics: 
standardized convergence point [30] and convergence rate [4]. The 
standardized convergence point calculates an average point of 
model convergence to the correct labels, while a particular visitor’s 
sequence not converged to a correct prediction is penalized. This 
metric extends the conventional convergence point metric to 
account for sequences that are ultimately predicted incorrectly and 
fail to converge by instituting a penalty term [4]. In this instance, 
standardized convergence point is greater than one. In cases of 
convergence, a sequence’s standardized convergence point falls 
within the range [0, 1]. Equation 3 displays the formula used to 
calculate the standardized convergence point across all sequences, 
where m is the number of sequences, and n; is the number of data 
points in the i visitor’s sequence. The value of k; is the number of 
data points after which the model makes consistently accurate 
predictions, otherwise k; equals n;+p;, where p; is the penalty term 
for the i” sequence [30]. (p; is set to 1 for all sequences in this work 
following the original work.) A lower standardized convergence 
point indicates that the model’s predictive accuracy tends to 
converge earlier in a visitor’s interaction with the exhibit, 
indicating better early prediction performance. 

(i) 
Ny 


Standardized convergence point = >: 
i=1 
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The second metric that we use to quantify a model’s early 
prediction performance is the convergence rate. Convergence rate 
is the percentage of observed sequences in which the final 
prediction is accurate. Any sequence that contains an accurate 
dwell time prediction at the last data point is considered to have 
converged. Therefore, a higher convergence rate is indicative of 
better performance. 


7. RESULTS AND DISCUSSION 


The results for the unimodal and multimodal models as well as the 
unimodal latent representations (i.e., target-only encoding) and 
domain-adapted representations are shown in terms of early 
prediction and visitor-level predictive performance in Table 3. To 
measure visitor-level performance, a single point estimate of the 
predictive performance for each individual visitor is obtained by 
averaging across the predictions for all data points. The results for 
Table 3 are shown in terms of standardized convergence point 
(SCP) and convergence rate (CR) for early prediction, and area 
under curve (AUC), Cohen’s Kappa, accuracy, and F1 score for 
visitor-level performance. Although AUC is commonly used for 
binary classification problems, we use this metric for a multi-class 
approach using a “one vs. rest” method which treats the correct 
class as the “positive” group and combines all other classes as a 
single “negative” group. The total AUC for a single model is 
calculated by using the unweighted mean of the AUC values across 
all three groups. 


Based on the results in Table 3, the adversarial domain adaptation 
allows the multimodal classifier to outperform all baselines in 
terms of early prediction and across all sequences for each visitor. 
As expected, the complete multimodal model achieved the highest 
performance, achieving an AUC value of 0.660, while also 
outperforming the other models in all other evaluation metrics. The 
model achieved a standardized convergence point of 64.58%, 
indicating that the model achieved and maintained its optimal 
predictive performance approximately 64% into a visitor’s total 
dwell time at the exhibit, while converging to the correct 
predictions more often than other baseline approaches. The 
interaction-only modality produced noticeably lower performance, 
achieving a convergence point of 75.95%, while also reaching a 
0.574 AUC across all sequences. The adversarial domain 
adaptation allowed the classifier to achieve higher performance on 
the interaction-only data, with an early prediction performance of 
67.42% and a visitor-level AUC of 0.585, similar to the full 
multimodal model while also outperforming the interaction-only 
baseline across all evaluation metrics. 


The classifier’s performance on the latent unimodal data (without 
domain adaptation) was notably poor, achieving an AUC that was 
slightly worse than random chance (0.500). This result is not 
surprising, as we are evaluating the model’s performance using 
latent representations from a domain that has not been used to train 
the model beforehand. Although similar baseline approaches can 


achieve moderate performance in instances where the source and 
target domains are relatively similar, other work that investigates 
cross-modality adaptation or adaptation across dissimilar domains 
achieves much lower performance for this specific baseline [42]. 


While the adversarial domain adaptation proved more effective 
than the interaction-only and latent unimodal data baselines, the 
performance of our framework did not achieve the same 
performance as a framework that contained the full multimodal 
data. This could be attributed to the significant difference between 
the interaction and facial expression domains. The majority of the 
interaction-based modality is comprised of discrete, monotonically 
increasing features, which inherently are not as data-rich as the 
features from the facial expression modality. Because there are 
multiple features for each AU, this modality provides multifaceted 
perspectives on multiple AUs, leading to a relatively high number 
of continuous features. Adapting between two data channels with 
such a discrepancy in dimensionality may be a contributing factor 
to the framework’s performance. Second, the relatively small 
number of visitors in the dataset may also be a contributing factor, 
as the performance of the models could be at risk for overfitting the 
classifier, source encoder, or target encoder. Contributing to this 
potential issue is the loss induced in the domain adaptation process. 
The size of the dataset may prevent the adversarial framework from 
reaching optimal convergence. Third, because there is no restriction 
regarding how long the visitors could remain at the exhibit, the 
target variable has a relatively wide range of values, approximately 
from one minute to more than ten minutes. Although this issue is 
addressed through the use of a tertile split, additional data could 
provide further evidence of behavioral patterns that are able to 
induce higher performance with more granular target variables. 


Because timestamped interaction trace logs are the basis of one of 
the modalities used in this work, the design of the museum exhibit 
may play a role in the performance of the visitor models in terms 
of early prediction. During the early stages of FUTURE WORLDS, 
visitors are prompted to read an information dialog box explaining 
the premise of the game and a summary of the problem to be solved 
in the virtual environment. Because this event occurs at the 
beginning of every visitor’s interaction sequence, it is likely that 
more indicative behaviors that allow the classifier to differentiate 
between groups occur at later stages of learner interactions with the 
exhibit. This is a potential explanation behind the early prediction 
performance of each model, as the standardized convergence point 
occurs after 60% of the overall exhibit interactions across all 
models. 


To further investigate the impact that domain adaptation has on the 
predictive performance of the multimodal classifier, confusion 
matrices based on the target-only encoder and the adversarially- 
trained encoder are shown in Figure 6 as is the confusion matrix for 
the interaction-only classifier. The purpose of this analysis is to 
determine if adversarial domain adaptation results in any changes 
relative to the classifier’s sensitivity to certain dwell time groups. 


Table 3. Visitor-level predictive performance (all sequences) 


SCP 
75.95% 
64.58 % 
73.79% 
67.42% 


Encoding Classifier CR 


Unimodal 
Multimodal 
Multimodal 
Multimodal 


Interaction-Only 
Multimodal 
Target-Only 


Domain Adaptation 


Early Prediction 


34.18% 
48.10% 
34.18% 
43.04% 


Visitor-Level Prediction 


Kappa Accuracy Fl Score 


0.660 0.278 0.519 0.511 
0.499 0.015 0.342 0.338 
0.585 0.203 0.468 0.468 
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Figure 6. Confusion matrices for classifiers using target-only, interaction-only, and domain adaptation-based representations. 


Based on the confusion matrix for the target-only classifier (i.e., the 
multimodal classifier evaluated on the interaction data without 
domain adaptation), the classifier appears to primarily predict high 
dwell times for a majority of visitors. The model also appears to 
frequently predict visitors with medium dwell time as having low 
dwell time. As this particular model performed similarly to a 
random chance classifier, it is likely that the interaction-only data 
representation was not easily identifiable to the classifier, leading 
it to primarily predict a single class and not classify the lower two 
groups accurately. The classifier that was trained and evaluated on 
interaction-only encodings performed slightly better and appears to 
become more accurate in cases of lower dwell time in visitors. 
However, it is notable that the model still does not appear to 
accurately predict instances of medium dwell time. This indicates 
that the interaction-based modality contains salient features 
indicative of noticeably low or high engagement but interactions 
from visitors with medium dwell time are not easily distinguishable 
to the model. Low dwell time may be characterized by a relatively 
low number of taps or interactions in the virtual environment, while 
high dwell time may be indicated by greater or more frequent 
tapping or interactions with the virtual environment. Additionally, 
visitors that have a higher dwell time are more likely to beat the 
game or read a higher number of information dialogs. However, 
this information may not be predictive enough with the ternary 
split, causing the interaction model to overfit to the two extremes. 


The multimodal classifier that processes the modality-invariant 
data representations performs noticeably better for visitors with 
medium dwell time and continues to maintain fairly accurate 
performance on visitors with high dwell time. This may indicate 
that facial expression captures physical cues that allow the model 
to more easily distinguish between the medium group and the other 
groups, and the domain adaptation allows these features to be 
integrated into the interaction-only representations. By 
implementing this approach across the two modalities, it appears 
that the multimodal model retains its robustness to visitors with a 
medium dwell time in particular, while being able to achieve this 
performance using only features from the interaction data. This is 
significant because it appears that the interaction-only model does 
not appear to induce high performance on the medium dwell time 
visitors, so it remains important to utilize the multimodal data 
representations obtained through domain adaptation as pre-training 
for accurately predicting the visitor dwell time. 


8. CONCLUSION 


Modeling visitor engagement is an important task in museum-based 
learning. However, visitor engagement modeling presents 
significant challenges, as visitors’ patterns of engagement with 
museum exhibits can vary widely. Multimodal frameworks show 
promise for the prediction of visitor engagement in museums 
because they capture information about visitor behavior that cannot 


otherwise be captured through interaction trace logs or similar 
unimodal data channels. Although multimodal sensor systems give 
rise to concerns about privacy, feasibility, and intrusiveness, the 
complete removal of sensor data from visitor engagement models 
may result in diminished predictive performance. To address this 
issue, we have introduced an adversarial domain adaptation 
approach to generating modality-invariant representations of 
interaction data and facial expression data from visitor interactions 
with the FUTURE WORLDS museum exhibit. The domain adaptation 
approach enables multimodal models to be induced in a pre- 
training phase while being deployed and evaluated with modality- 
invariant representations obtained using interaction-based data 
exclusively. We investigate the models’ ability to predict visitor 
dwell time during the early stages of a visitor’s interaction with the 
museum exhibit. Results indicate that the domain adaptation 
approach to modeling visitor engagement achieves higher 
performance than a visitor modeling approach using only a single 
modality. The domain adaptation approach also outperforms the 
unimodal baseline during early sequences of a visitor’s interaction 
trajectory as well as across all sequences while demonstrating 
competitive performance compared to classifiers utilizing 
multimodal data. 


There are several promising directions for future work. Alternative 
techniques for modeling visitor engagement should be evaluated, 
including sequential models like long short-term memory (LSTM) 
networks, to improve models’ predictive accuracy and early 
prediction. Alternative approaches to the adversarial learning 
component of this framework include the use of generative models 
such as GANs or variational autoencoders. Attaining reliable 
training convergence continues to be a challenging problem within 
adversarial learning and investigating solutions to this issue may 
enhance the benefits of domain adaptation. The generalizability of 
the domain adaptation framework should be evaluated using larger 
and more diverse visitor populations on different exhibits and 
museum settings. Additionally, the domain adaptation framework 
should be evaluated using additional combinations of modalities 
(e.g., posture, gaze, speech), and extended to include three or more 
modalities simultaneously. Finally, this framework should be 
evaluated at run-time by integrating visitor engagement models into 
a museum exhibit to enable visitor-adaptive interventions to enrich 
visitor engagement and enhance museum-based learning 
experiences. 
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ABSTRACT 


Knowledge Tracing (KT) is a task to model students’ know]- 
edge based on their coursework interactions within an Intel- 
ligent Tutoring System (ITS). Recently, Deep Neural Net- 
works (DNN) showed superb performance over classical meth- 
ods on multiple dataset benchmarks. While most Deep 
Learning based Knowledge Tracing (DLKT) models are op- 
timized for general objective metrics such as accuracy or 
AUC on benchmark data, proper deployment of the service 
requires additional qualities. Moreover, the black-box na- 
ture of DNN models makes them particularly difficult to 
diagnose or improve when unexpected behaviors are encoun- 
tered. In this context, we adopt the idea of black-box test- 
ing / behavioral testing from Software Engineering and (1) 
define desirable KT model behaviors to (2) propose a KT 
model analysis framework to diagnose the model’s behav- 
ioral quality. We test-run the framework using three state- 
of-the-art DLKT models on seven datasets based on the 
proposed framework. The result highlights the impact of 
dataset size and model architecture upon the model’s be- 
havioral quality. The assessment results from the proposed 
framework can be used as an auxiliary measure of the model 
performance by itself, but can also be utilized in model im- 
provements via data-augmentation, architecture design, and 
loss formulation. 


Keywords 


Knowledge Tracing, Deep Learning, Behavioral Testing, Model 


Validation 


1. INTRODUCTION 


Assessment is a central task in Education, as it is involved in 
meta-cognition [17], tracing the skill trajectory, recommen- 
dation of contents [36], adjustment of tutoring strategy [14], 
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and grading [3, 24, 33, 35]. With the advent of online edu- 
cational platforms, there is an increasing demand in build- 
ing assessment models using the interaction history data of 
users. One approach to track the skill of users is Knowledge 
Tracing (KT), which is the task to model students knowledge 
based on their coursework interactions within an Intelligent 
Tutoring System (ITS) [7]. 


To tackle the KT problem, the recent EdNet Challenge in 
Kaggle has gathered a total of 3,406 teams, 4,412 partici- 
pants, to submit 64,678 models. Participants trained KT 
models on the EdNet KT dataset [6], and the models were 
evaluated by the Area Under the Receiver Operating Char- 
acteristic Curve (AUC). The AUC of the top 5 models were 
0.820, 0.818, 0.818, 0.817, 0.817, which are very similar, and 
all models were based on the Transformer Neural Network 
structure [38]. While neural network structures are usu- 
ally designed to appropriate human intuitions, most models 
lack interpretability compared to classical models. There- 
fore, when evaluation results for black-box models do not 
vary significantly, it becomes unclear how to choose the best 
model for deployment. Also, while small quantitative differ- 
ence in the objective function or model AUC might not hurt 
the users’ perception of the model reliability, few adversar- 
ial decisions of the black-box model can dissuade the user’s 
faith. [34] also note that the performances of black-box mod- 
els that are trained for general metrics such as classification 
accuracy or AUC(Area Under receiving operator character- 
istic Curve) can be overestimated. 


As a result, Deep Learning based Knowledge Tracing (DLKT) 
models are not frequently implemented in the education 
community due to potential risks arising from the lack of 
model interpretability. In this study, we propose behavioral 
testing as an approach to alleviate this problem. The con- 
tribution of this work are summarized as follows: 


e We propose a novel testing framework to validate DLKT 
models using a test on behaviors. The idea is to define 
consistent and convincing behaviors to be desired on 
DLKT models. 


e As an example of applying the framework, we bench- 
mark three state-of-the-art DLKT models from the 
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proposed validation framework. Positive results high- 
light the reliability of DLKT models and encourages 
the model’s adoption, while negative results point out 
the limitations of DLKT models and show spaces for 
improvement. 


e We introduce methods to utilize evaluations from the 
framework to design and improve DLKT models. 


2. RELATED WORKS 


2.1 Knowledge Tracing 

Knowledge Tracing (KT) is the task to predict the expected 
correctness of an interaction of a student to a question by 
modeling the student’s knowledge from past interactions [7]. 
In this study, we formulate the KT task as follows: the inter- 
action sequence of a user is denoted as X“ = {x7,x3,--- , xp} 
where u € U is the user index. To simplify the notations, we 
omit the user index u unless specified. Each interaction x; = 
(diz, Ct) at step t is defined by the pair of question @, € Q 
and correctness cz € {0,1} where Q = {q1,q2,--: ,Qn} de- 
notes the set of all questions and 7; denotes the question 
index of step t. A KT model predicts the correctness proba- 
bility P(ce = 1]a1,22,--- , @t-1, qi, ) of an unseen interaction 
X+ at step t, where X; = 11, %2,--- , r+ is the first t interac- 
tions of an interaction sequence X. 


Notation Description 

uceUu User index 

xX Interaction sequence of a single user(= X“) 
Lt Interaction at time step t 

q€Q Question 

dt Question index at time step t 

Ct Correctness at time step t for question qi, 
ols ) Correctness at time step t for question q; 
Xt Interaction sequence of X up to time step t 


Many KT models utilize domain specific tags such as skill 
components of questions [27, 8, 31, 43, 42], difficulty of 
questions [32, 10, 37], or knowledge graphs [4]. Item Re- 
sponse Theory (IRT) [32] models the correctness probability 
of a student responding to a question using custom designed 
models, and fits the model parameters using maximum likeli- 
hood. For instance, the 4-PL model predicts the correctness 
probability of a user with skill level 0 solving item i by 


d; —-G 


pi(9) Sat 1+ e74i(O—b;)’ 


where ai, b:, c;, dj are parameters that model discrimination, 
difficulty, pseudo-guessing, and slip of item 7 [23]. 


Another prominent approach is Bayesian Knowledge Tracing 
(BKT) [27, 42], which uses Markov process to model diffi- 
culty of the question items and learning capability of the 
students. Another well known approach is Deep Knowledge 
Tracing (DKT) [31], which is the first Deep Learning based 
KT model (DLKT). Since the introduction of DKT, many 
researchers have worked on different network structures to 
capture the complex aspects of the knowledge state. There 
are a variety of models based on different structures such as 
DKVMN [43], DKT+ [41], SKVMN [1], SAKT [30], GKT 


[28], EKT [21], KTM [39], DHKT [40], SAINT [5], AKT 
[13], and PEBG [22]. 


While there exist a variety of different KT models, [12] per- 
formed a major experiment on the accuracy of three groups 
of KT models (Markov process, Logistic Regression, Deep 
Learning) on nine real-world datasets. While deep learning 
models do show better AUC and RMSE on some datasets, 
other linear models including the authors’ proposed BestLR 
approach yielded comparable or superior performances on 
most datasets, which also provided better model interpretabil- 
ity as well. 


2.2 Behavioral Testing in Other Applications 
To alleviate unexpected behaviors of black-box models, [2] 
introduces behavioral testing (also known as black-box test- 
ing) to test different capabilities of a system in the software 
engineering perspective. Many studies work on effectively 
designing test cases [18, 25, 26]. [29] gives a detailed re- 
view on the behavioral testing method applied in various 
software testing domains. In Natural Language Processing, 
[34] apply the behavioral testing framework to validate the 
behaviors of general NLP models. They introduce Check- 
List, which is a task agnostic methodology for testing NLP 
models. CheckList is a list of general linguistic capabilities 
and test type baselines for NLP tasks. It is also a software 
tool to generate test cases for NLP models. 


2.3 Behavioral Studies in Knowledge Tracing 
The expected behaviors of the KT models have been dis- 
cussed in some studies, which point out the adversarial be- 
haviors of KT models and propose new models to alleviate 
the problem. DKT+ [41] raises two problems of the Deep 
Knowledge Tracing (DKT) model [31], which are increas- 
ing correctness probabilities from false responses, and wavy 
prediction transition by time. However, these behaviors can 
naturally occur from the educational effects embedded in the 
interaction, which we discuss in detail in Section 3.1. The 
authors add three regularization terms in DKT+ to enhance 
the consistency of the predictions of DKT, and introduce ex- 
tra performance measures. 


The authors of [19] lists some desirable behaviors based on 
the monotonicity of the KT models to improve the general 
ability of the models. Then, they perform three types of 
novel data augmentation techniques(replacement, insertion, 
and deletion) and apply them to the training of KT models. 


As examined in these relevant studies, the adversarial be- 
haviors and low interpretability of DL models hinders the 
AIEd society to adopt Deep Learning based KT (DLKT) 
models and sustain on adopting interpretable models based 
on BKT, IRT, or Cognitive Diagnosis Models [9]. In this 
study, we provide a validation framework of DLKT models 
and conduct an extensive set of experiments on the desired 
behaviors of DLKT models. Good results highlight the relia- 
bility of DLKT models and encourages the model’s adoption 
on most datsets. On the other hand, bad results point out 
the limitations of DLKT models on some datasets and show 
spaces for improvement. 
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3. BEHAVIORAL TESTING FOR 
KNOWLEDGE TRACING 


We propose a black-box behavioral testing framework for 
knowledge tracing task. First we define the knowledge state 
(KS), and then elaborate on desirable behaviors of KT mod- 
els’ KS representation. Finally, in Subsection 3.2, we intro- 
duce specific experiment setups to assess whether DLKT 
models satisfy those behaviors. 


We define the knowledge state to be a vector representa- 
tion of a user’s correctness probability on a set of questions 
Q’ at a specific time point. Given the first ¢ interactions 
Xt = %1,%2,:-: , xt of a user, we define the user’s knowledge 
state as: 


KS; = [P(c = 11X:-1,4)| (1) 
q; €Q/ 

of? represents the Bernoulli indicator for the event when 
the user answers correctly to question q; at time step t, 
as defined in Table in Section 2.1, which is updated along 
the provision of the user interaction sequence. Note that a 
KS is the collection of prediction values of questions, which 
either is responded or not. We describe the desired aspects 
of DLKT models in the following section. 


3.1 Expected Behaviors of 
Traced Knowledge State 


First we introduce two properties on the change AKS of 
the knowledge state KS with respect to an atomic change 
AX of the interaction sequence X, which is an insertion or 
a deletion of single interaction record. 


First, monotonicity insists that the model’s knowledge state 
should be updated to a more knowledgeable state when the 
student adds another correctly answered question (positive 
interaction) or when an interaction record with incorrect 
response (negative interaction) of the student is deleted. If 
the A is applied in the middle of the interaction record, all 
changes after the perturbation should hold the property as 
well. Second, robustness insists that a little perturbation in 
the interaction history should not yield a dramatic change 
in future knowledge states. The details of the two properties 
are introduced below. 


e Monotonicity: If AX is a correctly responded inter- 
action (qi,,,1) at perturbation time tp, then we can 


track the relation of P(ch? = 1|X U AX,q;) and 
P(h®? = 1|X,q,;) depending on how gq; and AX are 
correlated. In many cases, a positive correlation in 
correctness probabilities is desired due to the relation 
of knowledge states: 


P(ch? = 1|X UAX,q;) > P(ch” = 1X, q) 


for t > tp and q; € Q. 


However, there can also be negatively correlated ques- 
tions which could be consequences of factors such as 
limited learning resource. For instance, a college stu- 
dent might sacrifice her studying time on one sub- 
ject over another when both subjects’ examinations 
are scheduled too closely with each other. This type 


of circumstance might cause the model to fit a non- 
monotonic relationship between the two fundamentally 
unrelated subjects. In most ITS’s, however, the target 
study domain is usually restricted to a single subject, 
or a set of knowledge components where the student’s 
comprehension on the components is usually positively 
correlated. Another case is when a negative response 
increases the correctness probability of a problem as 
described as an adversarial behavior in [41]. However, 
the educational effect of consuming a question can give 
positive feedback on the knowledge state even if the in- 
teraction response was wrong. Therefore, we assume 
the described monotonic behavior in general knowl- 
edge tracing environments while simultaneously keep- 
ing track of the opposite case in the experiments of 
Section 4. 


e Robustness: For any black-box system, it is gener- 
ally desirable that insignificant change in the system’s 
input leads to limited change in the output. For a 
knowledge tracing model in an ITS, the input refers to 
the student’s interaction record and the output refers 
to the model prediction on correctness probability for 
an encountered question or a set of questions. There- 
fore, we formulate the robustness of knowledge tracing 
model as below in a general sense, adopting the AX 
previously defined: 


[P(c\? = 1|X UAX, qj) — P(ck? = 1]X,q;)| < 


for some «, a single interaction AX, t > tp, and 
q; € Q. If we impose the inequality to always hold 
on t= t, +1 and fixed €1, then it is equivalent to im- 
posing continuity on the knowledge state in terms of 
time-steps. We treat continuity as a specific case of ro- 
bustness and introduce customized test for continuity 
separately from the test for robustness. 


However, consider a case when q; and qi, in AX assess 
similar concepts, or when the educational effect from 
the interaction with one question affects the student’s 
correctness probability on the other question. Then an 
insertion or deletion of one question is prone to have a 
significant impact upon the predicted correctness prob- 
ability value of the other for t > tp. Therefore, the 
defined robustness / continuity need not be univer- 
sally desirable for all pairs of questions. The impact of 
this property would eventually depend on the degree 
of dependency among the questions. Therefore, in the 
experiments, while assuming robustness for most ques- 
tion pairs, we also carefully track where some questions 
affect the prediction values of other questions in a no- 
table amount. 


Next we discuss what constitutes an expected value of knowl- 
edge state. Testing whether the knowledge state has accu- 
rately captured the user’s interaction history is in line with 
the existing quantitative metrics (AUC, ACC) adopted in 
KT literature. However, the existing test methods focus 
only on a single actual question data provided per each time 
step whereas we propose to assess knowledge tracing model 
via its knowledge state on a virtual question set in order 
to provide a more holistic assessment via knowledge state 
representation. 
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Although tracing knowledge state on a set of questions would 
provide a more comprehensive picture of how the user’s 
knowledge is traced, it lacks actual label of correctness on 
unseen questions at each time step, as discussed in Section 
3.1. Therefore, we describe below novel measures to assess 
correctness of knowledge state under a few purposefully de- 
signed circumstances. 


e Approximate Label of User-Independent Initial Knowl- 
edge State: At first prediction step, we approximate 
correctness label for all questions via their ’global dif- 
ficulty’. The initial knowledge state for all users should 
represent the model’s prior belief of question difficulty 
before encountering any user-specific interaction record 
data. It is reasonable to assess the quality of this value 
in terms of correlation of model’s prediction on each 
question and the question’s global difficulty. However, 
this is an approximation at best since the question’s 
difficulty might not be accurately captured by simple 
average over its occurrences. If the actual interaction 
data generated from the ITS provides a very difficult 
question only after user’s knowledge is significantly ac- 
cumulated, simple average of question correctness la- 
bels would not be representative of the question’s in- 
herent difficulty. Model’s prediction would be high. 


e Ideal Value of Knowledge State after Converged In- 
teraction Data: We generate obvious edge-case test 
cases where user’s knowledge state on a set of ques- 
tion has converged to a value. We create this virtual 
dataset by simply repeating an identical interaction 
record on each question consecutively. The model’s 
prediction value for the question in the repeated in- 
teraction should converge to the repeatedly provided 
label value. 


e Approximate Label of Knowledge State in General: 
It’s also possible to approximate a pseudo-label for 
unseen questions using rolling/expanding averages or 
IRT-like algorithms which demonstrate more stable 
and monotonic behavior by design. Although we con- 
jecture such training methodology of DLKT models us- 
ing pseudo-labels might provide regularization effect, 
we do not include this case in the scope of this work. 


Table 1: Behavioral Test Summary 


Analysis Method 
Monotonicity | Perturbation Test: Percentage of interac- 
tion samples of which model prediction 
changed in expected direction. 


Behavior 


Robustness Perturbation Test: Degree of impact from 
perturbation across time-steps. 
Continuity Continuity Test: Avgerage and maximum 


change in knowledge state score per step 
and throughout entire sequence. 

Initial Value Test: Correlation between 
question correctness rate and _ initial 
knowledge state. 

Convergence Test: Convergence speed as 
in model AUC and average model output 
at different time-steps. 


Initial Value 


Convergence 


3.2 Behavioral Test Setups 

Below we describe four behavioral testing setups for DLKT 
models. First, perturbation tests aim to test model’s mono- 
tonicity and robustness given an atomic perturbation to the 
original interaction sequence data. Second, continuity test 
aims to check whether model’s knowledge state represen- 
tation is continuous along the interaction sequence. Third, 
initial knowledge state test checks whether the initial knowl- 
edge state reflects each question’s corresponding difficulty 
measure. Fourth, convergence test checks whether the knowl- 
edge state converges to the expected value and how fast it 
converges. Following subsections elaborate each of the test 
setup in further detail. Table 1 provides summary of the 
tests. 


3.2.1 Perturbation Tests 

We examine monotonicity and robustness of the model by 
perturbation tests. We experiment three types of pertur- 
bations: insertion, deletion, and flip. For each original in- 
teraction sequence, we determine t,, which is the index of 
interaction to be perturbed. For the insertion case, we add 
a new interaction between Ltp—1 and Xt,- For the deletion 
case, we remove the interaction x;,. For the flip case, we 
flip the correctness of Lty from 1 to 0 and from 1 to 0. 


In order to check monotonicity, we assess whether the model’s 
predicted correctness probability in the following interaction 
sequence Xj;,.) changes towards the expected direction. For 
insertion / deletion / flip of an interaction to which user re- 
sponded correctly, we examine whether the following future 
correctness probability P(c,41 = 1|X;-), Vt’ > tp increases 
/ decreases / decreases, respectively. In the experiments, we 
fix the perturbation point to be located halfway in the user’s 
original interaction sequence, then measure the proportion 
of interactions which the model’s predicted correctness prob- 
ability changes towards the expected direction. 


To assess the model’s robustness, we visualize how the de- 
gree of impact from perturbation changes along the time 
steps from tp. We expect the degree of impact from per- 
turbation upon the model’s prediction to decay gradually as 
more interactions are fed into the model after the perturba- 
tion point ty. 


3.2.2 Continuity Test 

We test whether the knowledge state is continuous, in the 
manner described in the previous section 3.1. For every 
time-step, we provide the model with not only the origi- 
nal interaction at the corresponding time-step, but also a 
set of questions Q’ simultaneously to construct knowledge 
state KS; at the time-step. Although we don’t have actual 
correctness label for those virtual interactions, we only in- 
quire how the knowledge state or the model prediction on 
Q’ evolves along the time-steps. 


In the experiments, we approximate a score on the user’s 
knowledge state by averaging the model-predicted correct- 
ness probability over the sample set of questions Q’ to re- 
port: (1) average and maximum student score change per 
single time-step and (2) average student score change and 
range across 100 time-steps. 


3.2.3 Initial Knowledge State Test 
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Table 2: Dataset Statistics 


Dataset Users Items Skills #Intr. M%Crct. 
ASSIST15 14228 100 100 =6656K 73 
ASSIST17 1708 3162 411 935K 37 
STATICS 282 = =1223 98 189K 77 
Spanish 182 409 221 579K 77 
EdNet-small 5000 13156 118 = 518K 65 
EdNet-med 100000 13518 118 11M 64 
EdNet 605763 13528 118 =1388M 66 


We assess the quality of the prior knowledge state embedded 
by the model by the initial knowledge state test. Without 
any user-specific record, the prior knowledge state embedded 
in the model should accurately reflect the average difficulty 
of the question to all users. Thus, we check Spearman rank 
correlation and Pearson correlation between the question’s 
average difficulty and the model-predicted prior belief for 
each question. 


In detail, a trained DLKT model M’s initial knowledge state 
for a question q; can be represented as Pys(c = 1|-,q;). We 
compare this with the question correctness rate over the 
entire dataset as in Eq 2 which is equivalent to the number 
of correctly responded qj;-interactions over the number of 
occurrences of q; based on all user data. 


Duer Hel la, = a,c = 1} 
Cy, = 
qj . 
, MeO {af Nai. = qy}\ 


Consequently, we measure: 


Corr([Pu(c = 1|-,q3)] aj€Q? Coes (3) 


This initial knowledge state test pinpoints on whether the 
learned question embedding in the DLKT model alone has 
captured any information about the corresponding ques- 
tion’s difficulty. Moreover, we emphasize the importance 
of the initial knowledge state since the state assumed by the 
model would likely affect the user’s first impression on the 
system to make decisions. 


3.2.4 Convergence Test 

In convergence test, we assess whether the model’s knowl- 
edge state value converges to a target value in a desired 
manner. We generate simple virtual interaction sequence 
data by repeating an identical interaction for 50 time-steps 
for each question for both correctness cases. Therefore, the 
virtual dataset would consist of virtual user interaction se- 
quences of size twice of the number of questions. 


In the experiments, we report the model’s standard AUC 
metric at time-steps 5, 10, and 50. We expect significantly 
high figures as the inquired interaction sequence is extremely 
simple. We also visualize how the average model prediction 
value across the questions evolves throughout the 50 time- 
steps for each of the correctness case. We expect the values 
to quickly converge to 1 / O for interaction sequences of 
which correctness label is all correct / incorrect, respectively. 


4. EXPERIMENTS 

In this section, we benchmark three Deep Learning based 
Knowledge Tracing models DKT, SAKT, and SAINT on the 
proposed behavioral tests. First, we train optimized mod- 
els for each architecture-dataset pair by searching hyper- 
parameters on the train and validation data split. Second, 
we report the classification accuracy and the AUC met- 
ric, which are commonly used for model assessments in the 
Knowledge Tracing literature. Third, we present the pro- 
posed behavioral test results of model instances on well- 
known datasets for Knowledge Tracing. 


4.1 Datasets 


We describe the datasets used in our experiments. All datasets 
are open to the public. 


ASSISTments[11] is a dataset containing student interac- 
tions from an online tutoring system for solving Massachusetts 
Comprehensive Assessment System (MCAS) 8th Math test 
questions. We use the datasets ASSISTments 2015 (Assist- 
ments15) and ASSISTments Challenge 2017 (Assistments17). 


STATICS is a dataset containing college student interactions 
on a one-semester Statics course. This dataset is available 
in the PSLC DataShop web site [16]. 


Spanish{20] is a dataset containing middle-school student in- 
teraction data for Spanish exercises. 


EdNet|6] is the largest public benchmark education dataset 
containing user interaction data of an online tutoring sys- 
tem, for preparing TOEIC (Test of English for Interna- 
tional Communication®)). For ablation studies on the size 
of the dataset, we randomly choose 100,000 users for EdNet- 
medium, and 5,000 users for EdNet-small. 


Table 3: Model Hyper-parameters 


Model Parameter Tuning Details 

Common | Adam learning rate 0.001, 0.003, 0.01 
Dropout rate 0, 0.25, 0.5 
Embedding dimension | 64, 128, 256 
Maximum Seq.Length 100, 200, 400 

DKT # Recurrent Layers 1, 2, 4 

SAKT # Attention Layers 1,2,4 

SAINT Warm-up Steps 200, 400, 4000 
# Attention Head 1,4,8 


4.2 Models and Algorithms 


We perform hyper-parameter tuning on the training of mod- 
els. For each configuration of hyper-parameters, we choose 
the model weights with the best validation AUC. In the 
training step, an early-stopping policy is applied with pa- 
tience 30, which means that we stop the training process 
and save the best weights if there is no AUC improvement 
in the recent 30 validation steps. Among the best weights 
for each configuration, we choose the weight with the best 
validation AUC for each dataset, and evaluate the weights 
with an independent test set for test metrics. 


4.2.1 Training Details 
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In this study, we use DKT, SAKT, SAINT in the experi- 
ments . DKT models the student’s knowledge state using 
a Recurrent Neural Network (RNN) by compressing the in- 
teraction history in a hidden layer. SAKT is the first KT 
model that used self-attention layers, where in each layer 
the question embeddings are queries and interactions em- 
beddings are key and values. SAINT is the first KT model 
based on Transformers. The sequence of questions is fed into 
the encoder, and the sequence of responses are fed into the 
decoder with the encoder output. 


Our model hyper-parameters are shown in Table 3. We 
use the Adam optimizer [15] with default parameters. For 
SAKT and SAINT, we used the Noam scheme for scheduling 
the learning rate, and tune the number of warmup-steps. 


The original SAKT implementation does not include resid- 
ual connection from the query. This enforces the first pre- 
diction to a same number every time, regardless of the first 
question provided to the model. Since 3.2.3 becomes redun- 
dant, we use the modified SAKT architecture with residual 
connection. For SAINT and SAKT, the dimension of the 
feedforward network is set to 4x(model dimension). For 
SAINT, we use the same number of attention layers for the 
encoder and the decoder. 


4.3 Results: Traditional Assessment 

AUC and accuracy results are shown in Tables 4 and 5. The 
difference of these standard metrics is generally less than 
0.01. For KT-based tutoring systems, this difference would 
be less important than behavioral performance. AUC shows 
the monotonicity of interactions by all users, and accuracy 
does not focus on the exact model prediction. On the other 
hand, behavioral tests can check the performance of model 
prediction for a single user, and analyze the impact of a 
single interaction. 


Table 4: Standard AUC metric 


Model DKT SAKT SAINT 
Assistments15 | 0.7242 0.7226 0.7179 
Assistments17 | 0.7742 0.7650 0.7680 

STATICS 0.8269 0.8248 0.8275 
Spanish 0.8336 0.8456 0.8364 
EdNet-small 0.7332 0.7380 0.7328 
EdNet-medium | 0.7717 0.7760 0.7722 
EdNet 0.7810 0.7905 0.7863 
Average 0.7778 0.7804 0.7773 


Table 5: Standard Classification Accuracy(%) 
Model DKT SAKT SAINT 
Assistments15 74.2 74.6 74.4 
Assistments17 72.1 71.0 71.8 
STATICS 81.4 81.2 81.1 
Spanish 81.9 82.6 82.0 
EdNet-small 68.2 70.2 69.8 
EdNet-medium 72.5 72.6 72.4 
EdNet 73.5 74.1 73.9 
Average 74.8 75.2 75.1 


4.4 Results: Behavioral Testing 
4.4.1 Perturbation Tests 


We report the test pass rate for insertion, deletion, and flip. 
The results are shown in Table 6, 7, and 8, respectively. 
Figure 1 describes the average impact on model prediction 
from insertion perturbation on each dataset (column) and 
correctness label of the inserted interaction (row). Figure 2 
describes the degree of maximum impact over user sequences 
from insertion perturbation. 


Table 6: Insertion Test Pass Rates(%) 
Model DKT SAKT SAINT 
Assistments15 70.9 70.3 65.3 
Assistments17 69.6 55.7 56.7 
STATICS 71.1 61.0 58.2 
Spanish 80.1 75.6 60.7 
EdNet-small 66.3 78.0 75.9 
EdNet-medium | 83.2 80.6 77.3 
EdNet 72.7 71.6 71.2 
Average 73.4 70.4 66.5 


Table 7: Deletion Test Pass Rates(%) 
Model DKT SAKT SAINT 
Assistments15 69.0 66.3 62.1 
Assistments17 63.7 54.4 54.6 
STATICS 60.7 55.9 49.2 
Spanish 81.6 81.9 59.7 
EdNet-small 65.6 75.3 71.9 
EdNet-medium | 80.3 76.8 73.5 
EdNet 72.3 68.3 69.5 
Average 70.4 68.4 62.9 


Table 8: Flip Test Pass Rates(%) 
Model DKT SAKT SAINT 
Assistments15 77.1 96.3 94.7 
Assistments17 69.5 86.1 66.4 
STATICS 93.4 92.9 84.7 
Spanish 87.5 89.1 83.9 
EdNet-small 75.2 95.0 95.8 
EdNet-medium 87.8 94.8 95.5 
EdNet 79.5 83.6 86.0 
Average 81.4 91.1 86.7 


e In general, deletion and insertion pass rates range from 
60% to 80%, and flip pass rates range from 80% to 
90%. Note that a flip can be interpreted as a com- 
bination of deletion and insertion. Therefore, the im- 
pact of perturbation is supposed to be larger, leading 
to higher pass rates as compared to insertion/deletion 
cases. From Figure 1, Figure 6 (Appendix), and Figure 
7 (Appendix), we note that the degree of impact from 
replacement is twice of that from insertion or deletion. 


e Robustness: From Figure 1, we observe that the de- 
gree of average impact from perturbation gradually de- 
creases along the time-steps in general, and that the 
average impact is limited by only about 2%. There- 
fore, the desired robustness holds in terms of average 
impact. 


e Monotonicity: From Figure 1, the average impact from 
positive/negative perturbation tends to remain posi- 
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Figure 1: Perturbation Test: Average Impact on Model Prediction from Insertion. 
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Figure 2: Perturbation Test: Maximum Degree of Impact on Model Prediction from Insertion. 
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tive/negative, respectively, for Assistments15, EdNet- 
small, EdNet-medium, and EdNet datasets. On Span- 
ish dataset, such trend is more noisy for SAINT net- 
work. 


e On Assistments17 and Statics dataset, the expected 


monotonic behavior from SAKT and SAINT is not 
observed as the average impact oscillates across zero. 
This can be also seen from DKT’s significantly higher 
pass rates on the two datasets in Table 6. 


e From Figure 2, we note that there exists questions 


which persist to respond in a larger degree (even up 
to 80%) after 40 time-steps. Note that on the EdNet 
dataset, both transformer-based architectures SAKT 
and SAINT allow larger impacts from perturbations 
than DKT. This can be explained by the superior per- 
formance of the two models on EdNet data over DKT 
in terms of standard evaluation metrics AUC and ACC. 


4.4.2. Continuity Test 


We report average and maximum step-wise change in KS 
score over students in Table 9. Apart from the single-step 
change, we also measure final change of score from the first 
time-step to the last, and the total range of score explored 
throughout the time-steps, averaged over all students in Ta- 
ble 10. Sum of absolute change in KS coordinates, or Man- 
hattan distance of KS’s along time-steps (averaged over all 
students) is shown in Figure 3. EdNet-medium was omitted 
due to its similarity with the plot from EdNet-small. 


e Except for Assistments17 and Statics, average score 


change per single time-step or an interaction remains 
reasonably low below 5% for all architectures. This 
suggests that the knowledge state is fairly stable across 
the time-steps. 


e On Assistments17 and Statics, we observe significantly 


larger changes, especially in DKT. DKT’s maximum 
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Figure 3: Continuity Test: Average Change in KS along Time-steps 
Table 9: Continuity Test: Average / Maximum Student Score Change(%) per Single Time Step 
Model Assist15 Assist17 STATICS Spanish EdNet-small EdNet-med EdNet | Average 
DKT Avg 2.45 10.48 6.39 2.93 2.04 2.14 2.27 4.10 
Max 16.82 84.97 65.09 20.33 16.98 17.06 33.68 36.42 
SAKT Avg 1.97 3.92 1.15 1.68 1.29 1.19 1.48 1.81 
Max 15.65 56.60 20.24 52.92 21.46 19.21 22.53 29.80 
SAINT Avg 4.64 7.99 2.73 3.48 2.02 1.92 1.29 3.44 
Max 31.80 53.01 24.18 38.74 16.17 22.12 17.19 29.03 


Table 10: Continuity Test: Average Student Score Total Change(%) / Total Range(%) over 100 Interactions 


Model Assist15  Assist17 STATICS Spanish EdNet-small EdNet-med EdNet | Average 
DKT Diff 10.14 11.50 9.94 18.30 9.23 12.82 10.83 11.82 
Range 24.72 63.42 47.54 46.54 22.89 27.41 28.37 37.27 
SAKT Diff 15.96 13.97 15.67 18.36 10.48 13.00 8.95 13.77 
Range | 30.98 40.16 22.08 53.33 23.70 23.80 20.69 30.68 
SAINT Diff 13.34 11.62 6.97 20.53 11.32 5.30 9.01 11.15 
Range | 37.98 49.96 26.19 51.51 22.98 24.21 20.59 33.35 


score change across a single time-step is as high as 85% 
and 65% for Assistments17 and Statics, respectively. 


In general, we observe decreasing marginal impact of 
each interaction data as time proceeds. 


From Figure 3 and Table 9, we note that SAKT’s 
knowledge state changes significantly less than other 
models, consistently throughout all datasets. We also 
investigated whether this ’speed’ of change affects to- 
tal ‘dislocation’ of knowledge state in Table 10. In- 
terestingly, SAKT’s knowledge state moved by far- 
thest on average (13.77%) while its range explored was 
the smallest (30.68%) on average. This suggests that 
SAKT’s knowledge state evolution was least volatile. 


4.4.3 Initial Knowledge State Test 
To assess the validity of initial knowledge states embedded 
in the model, we measured the correlation of the predicted 


prior 
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knowledge state and the global question difficulty as 


described in Section 3.2.3. The results are presented in Table 
11. In the scatter plot of Figure 4, we choose the 200 most 
frequently answered questions from each data-set to show 
how the initial model predictions and question correctness 
rates are distributed and correlated. 


e We observe from Table 11 that all models’ initial knowl- 


edge states are positively correlated with the global 
question difficulty with statistical significance. 


The difference in correlation metrics among datasets 
is much more significant than that among models. 


Based on the three scatter plots of the first row in the 
figure, we note that the correlation becomes stronger 
as the size of dataset grows from EdNet_Small to full 
EdNet data. Table 11 and Table 2 also suggests that 
the number of interactions per unique question is pos- 
itively correlated with the initial knowledge state test 
metric. 
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e From the scatter plot, we see that the three models 
occupy slightly different clustering regions in the plot. 
For instance, in EdNet_Medium and EdNet dataset, 
SAINT’s initial prediction value is consistently larger 
than that of the other two models, which suggests en- 
semble of the models to reduce bias. 


Table 11: Initial Knowledege State Test: Correlation(%) 

Model DKT SAKT SAINT 
Assistmentsl5 | 85.6 84.8 82.5 
Assistments17 | 63.2 52.2 58.2 
STATICS 56.1 58.1 54.6 
Spanish 56.5 49.1 44.0 
EdNet-small 39.0 38.5 31.6 
EdNet-medium | 75.1 74.2 74.4 
EdNet 87.6 86.5 88.2 
Average 66.1 63.3 61.9 


4.4.4 Convergence Test 

As the dataset we generate and use for the convergence test 
is extremely simple as described in Section 3.2.4, we expect 
the KT models’ standard performance metrics to increase 
quickly along the time-steps. For instance, at the fifth time- 
step, the model would have already received four equivalent 
interaction record with the same question and the same cor- 
rectness label for the virtual student. We report the model 
AUC at time-step 5, 10, and 50 in Table 12. In Figure 3 we 
also visualize the model’s average response across different 
questions for each correctness label values assumed. We ex- 
pect the average response plot to quickly converge to either 
1 or 0 based on the assumed correctness label value. 


e In general, all models show fairly high performance 
from early time-step of 5, except for Assistments17 
dataset. 


e Both Table 12 and Figure 5 suggest SAKT consistently 
achieves fastest convergence to a reasonable value close 
enough to either 1 or 0. DKT, however, consistently 
converges to a value farther from the two edges, as 
compared to the other two models. In particular, for 
the incorrect case (second row) of the Figure 5, we 
observe DKT converges to a value higher than 50% 
(red dotted horizontal line). For the positive case (first 
row), DKT converges to a correctness probability level 
around only 70% for Assitments17, EdNet-small, and 
EdNet-medium. 


e DKT’s convergence pattern is fairly monotonic while 
SAKT and SAINT’s patterns go through fluctuation 
which likely pertains to noise. 


e It is noteworthy that increasing dataset size from EdNet- 
small to EdNet-medium and EdNet significantly helps 
all three models’ convergence behavior on both target 
correctness values, especially for DKT. DKT’s conver- 
gence value moved signficantly closer to desired values 
of 1 and 0. For SAKT and SAINT, larger dataset size 
led to more stable response plot, reducing the degree 
fluctuation. 


e Convergence in the incorrect case and the correct case 
is asymmetric. While the latter closely achieves the 


target value of 1, the former case converges around 
30% level in most datasets. We attribute this to the 
tutorial content embedded in each of the interaction, 
along with the question item used for assessment in 
the dataset. 


4.5 Overview of Experimental Results 

Based on the proposed DLKT validation framework, we 
conducted a comprehensive investigation of three popular 
DLKT models on seven benchmark datasets to scrutinize 
the models’ behavioral characteristics. The results high- 
light strengths and weaknesses of three DLKT models. Al- 
though DLKT models demonstrated stable and robust be- 
haviors in line with expectation in most datasets, the results 
revealed few major disadvantages for each models: DKT 
showed better stability in perturbation tests while the other 
architectures occasionally presented volatile fluctuation in 
the response curve. In the continuity test, SAKT presented 
a significantly smoother evolution of knowledge state, but 
other models’ knowledge state representations were seem- 
ingly volatile in a few datasets which strongly precludes 
DLKT’s adoption. On the other hand, this suggests room 
for improving DLKT models based on the specific issue pin- 
pointed by this framework. For instance, the volatility of KS 
could be alleviated by direct regularization of the change in 
the KS. On the other hand, the results from the convergence 
test showed that DKT was fragile even to simple edge-case 
data which undermines generalization capability of DKT, as 
compared to other attention-based architectures. 


These behavioral characteristics identified from the proposed 
framework show that the two popular architectural paradigms, 
RNN and Attention-based, possess different strengths and 
weaknesses under KT environment. This also hints that an 
architectural combination or ensemble approach might al- 
leviate the identified issues to improve both standard KT 
model evaluation metrics and behavioral characteristic. 


5. CONCLUSION 


In this work, we introduced the desired properties of knowl- 
edge tracing models and proposed a novel model valida- 
tion framework for Deep Learning based Knowledge Tracing 
(DLKT) models. Using the framework, we conducted a com- 
prehensive analysis of three popular DLKT models’ behav- 
ioral characteristics and identify their strengths and weak- 
nesses of the models in seven different benchmark datasets. 
We believe that the analysis on both strengths and weak- 
nesses diagnosed by the framework would serve as a useful 
guideline for model enhancement. Also based on the find- 
ings from the proposed framework, a customized adoption of 
DLKT models fitting to the nature of the data and desired 
behaviors as well as accuracy would become possible. 


We believe potential future work includes: (1) tackling the 
weaknesses of DLKT models identified in this work via ar- 
chitectural modification or model combination, (2) explor- 
ing the benefit of data augmentation using virtual edge-case 
data similar to converging interaction data used in the con- 
vergence test, and (3) extending the proposed testing frame- 
work beyond the task of knowledge tracing (i.e. student 
score prediction and item recommendation). 
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Figure 5: Convergence Test: Average Model Prediction along Converging Interaction Sequence 
Table 12: Convergence Test AUC 

Step Model | Assist15 Assist17 STATICS Spanish EdNet-small EdNet-med EdNet | Average 
DKT 0.867 0.601 0.829 0.688 0.622 0.790 0.845 0.749 
5 SAKT 0.885 0.615 0.903 0.805 0.869 0.853 0.861 0.827 
SAINT 0.866 0.609 0.936 0.731 0.628 0.881 0.854 0.786 
DKT 0.932 0.634 0.924 0.745 0.679 0.874 0.932 0.817 
10 SAKT 0.961 0.782 0.928 0.923 0.953 0.928 0.951 0.918 
SAINT 0.947 0.763 0.979 0.856 0.757 0.958 0.954 0.888 
DKT 0.979 0.695 0.983 0.791 0.744 0.954 0.990 0.876 
50 SAKT 0.998 0.938 0.942 0.993 0.997 0.994 0.995 0.980 
SAINT 0.995 0.939 0.999 0.977 0.963 0.997 0.997 0.981 
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Figure 6: Perturbation Test: Average Impact on Model Prediction from Deletion. 
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ABSTRACT 


Predicting student problem-solving strategies is a complex 
problem but one that can significantly impact automated 
instruction systems since they can adapt or personalize the 
system to suit the learner. While for small datasets, learning 
experts may be able to manually analyze data to infer stu- 
dent strategies, for large datasets, this approach is infeasible. 
We develop a Machine Learning model to predict strate- 
gies from student data. While Deep Neural Network (DNN) 
based methods such as LSTMs can be applied for this task, 
they often have long convergence times for large datasets 
and like several other DNN-based methods have the inher- 
ent problem of overfitting the data. To address these issues, 
we develop a Neuro-symbolic approach for strategy predic- 
tion, namely a model that combines strengths of symbolic AI 
(that can encode domain knowledge) with DNNs. Specifi- 
cally, we encode relationships in the data using Markov Logic 
and use symmetries among these relationships to train an 
LSTM more efficiently. In particular, we use an importance 
sampling approach where we sample the training data such 
that for clusters/groups of symmetrical instances (instances 
where the strategies are likely to be symmetric), we only 
pick representative samples for training the model instead 
of using the whole group. Further, since some groups may 
contain more diverse strategies than the others, we adapt the 
importance weights based on previously observed samples. 
Through empirical evaluation on the KDD EDM challenge 
datasets, we show the scalability of our approach. 


Keywords 
Intelligent Tutoring Systems, Learning Strategies, Neuro- 
Symbolic AI, Markov Logic Networks, LSTMs 


1. INTRODUCTION 


Intelligent Tutoring Systems (ITSs) [31] and more broadly 
adaptive instructional systems (AISs)' help a diverse pop- 


'The main difference between Intelligent Tutoring Systems 
and Adaptive Instructional Systems at least in our view is 
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ulation of students by adapting instruction to each learner 
thus accounting for different learning abilities, learning styles 
and education goals. Such adaptation leads to more engag- 
ing and effective learning. However, in order to build effec- 
tive ITSs, it is important to understand how students learn 
and what learning and instructional strategies are most ef- 
fective for whom and under what conditions. Specifically, 
students can follow several different strategies to learn the 
same content. For example, consider a simple math problem 
about solving a linear system of equations where x+y+z = 9 
x=yandy =z. One strategy is to perform systematic sub- 
stitutions till there is an equation in terms of one variable 
and simply solve for that variable. However, another strat- 
egy could be to use transitivity to see that all three variables 
have the same value and use this to solve the problem. De- 
pending upon the way a student thinks, one strategy could 
be easier or harder to grasp compared to the other. Thus, 
understanding the various ways in which students approach 
an instructional task will not only further our understanding 
of how learners learn, e.g., it may help identify the most ef- 
fective learning strategies employed by top performers, but 
it will also enable ITSs to incorporate knowledge about these 
strategies in order to adapt appropriately and help students 
maximize their learning gains. 


A student’s choice of strategy is a complex function depen- 
dent on many factors such as experience with similar prob- 
lems, general expertise in the topic, other cognitive abili- 
ties, etc. Human experts are exploring all these factors and 
how they are related to strategy use and learning. However, 
human experts are expensive and limited in the ability to 
analyze large data from thousands, tens of thousands, or 
millions of students. Advanced data science methods and 
access to large computing infrastructure such as the cloud 
offer new possibilities to analyze in-depth and at scale such 
large learner datasets with the promise of helping us dis- 
cover, document, and benefit from the diversity of learning. 


Indeed, with the growth of both data and advanced Machine 
Learning methods such as Deep Neural Networks (DNNs), 
we are able to successfully solve several challenging prob- 
lems in domains such as natural language understanding [25] 
and visual processing [8]. However, the ability of DNNs 
to learn complex functions and representations comes at a 
cost. Specifically, DNNs require significant computational 


that the former offer full-adaptivity, i.e., both micro- and 
macro-adaptivity, whereas the latter can offer any type, e.g., 
just macro-adaptivity. 
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resources to scale up to large datasets and at the same time, 
as data increases, they may not always yield expected re- 
sults since they have a tendency to overfit the data. That 
is, they work well on the data on which they were trained 
but their generalization performance on unseen data severely 
degrades. A paradigm that is gaining significant attention 
in the AI/Machine Learning research community is Neuro- 
symbolic AI [7] where we augment DNNs with symbolic 
models to regularize the DNN. This helps in improving both 
scalability and generalization by allowing DNNs to learn 
from smaller datasets with higher accuracy. In this paper, 
we apply a Neuro-symbolic model to predict student strate- 
gies from structured student-interaction data. 


Our model works on student data where the student inter- 
acts with an ITS in discrete steps. The strategy prediction 
task in this case can be formulated as a sequence learning 
problem where we want to learn to predict the sequence of 
steps a student is likely to follow for a given problem. Know- 
ing this sequence will provide the ITS prior information that 
it can use to adapt, e.g., by better tailoring the available 
hints and feedback. Sequence learning is applicable to many 
tasks in general with one of the most popular applications 
being machine translation where the goal is to translate a 
sequence of words from one language to another language. 
A widely used traditional approach that has been applied 
in sequence learning is Hidden Markov Models (HMMs). 
However, this assumes a one-step dependency where each 
step depends only upon the prior step. Student strategies 
in problem solving though are much more complex where a 
student action in one step can influence several downstream 
steps. Long Short-Term Memory (LSTMs) [11] are DNNs 
that can model such long-term dependencies in sequences. 
However, LSTMs are known to take an extremely long time 
to train for large datasets [39]. To address this, we propose 
a Neuro-symbolic model where we combine the semantics of 
a symbolic AI model called Markov Logic [9] with LSTMs. 
Markov Logic encodes domain-knowledge using first-order 
logic formulas. The formulas establish relationships in the 
data that can be described in the form of a graph structure. 
Our approach learns symmetries between instances based 
on the graph structure and then uses these symmetries to 
train an LSTM more efficiently. Specifically, we use impor- 
tance sampling to choose a subset of training instances to 
make learning more efficient. While importance sampling- 
based learning in DNNs has been used previously to scale 
up training [1, 13, 14], most existing approaches determine 
the importance of a training instance by estimating the gra- 
dient norm which is computationally expensive [14]. In our 
case, we determine importance of a training data instance 
based on symmetries in the MLN graph. Specifically, the 
idea is that if several strategies are likely to be symmetrical 
then we can learn more efficiently from a smaller subset of 
strategies instead of the whole training set. To do this, we 
learn an embedding such that problem instances which have 
symmetries in the graph have similar vector representations 
in the embedding. We then cluster the embedding vectors 
and sample instances from each cluster to train the model. 
The idea is to sample a subset with same overall distribution 
of strategies as the original dataset. That is, we end up with 
a smaller dataset the preserves much if not all the informa- 
tion in the original dataset. However, since the clustering is 
approximate, we may end up with some clusters where the 


strategies are likely to be more diverse than others. There- 
fore, we adaptively train the model by updating importance 
weights for the clusters in iteration t+1 based on the trained 
model in iteration t. Specifically, we sample more data in- 
stances from a cluster in t+ 1 when the model trained in 
iteration t has smaller accuracy for instances sampled from 
that cluster. 


We evaluate our approach on the publicly available KDD 
EDM challenge datasets [34]. We compare our approach 
with HMMs and pure LSTM methods that do not use sym- 
metries in training the data and show that our proposed 
Neuro-symbolic model is more accurate and scalable, where 
we obtain high prediction accuracy by focusing on a small 
fraction of the training data. 


2. RELATED WORK 


Ritter et al. [30] provide a comprehensive survey on different 
approaches used to identify student strategies. Model trac- 
ing based tutors [4] have been previously used to identify 
strategies. In such cases, strategies may be pre-specified 
and the tutor can recognize correct and incorrect strate- 
gies. Model-tracing based methods have also been adapted 
to recognize new strategies [29]. Sequence learning has been 
widely used for strategy identification. Specifically, in Open 
Ended Learning Environments such as Betty’s brain [23], 
student activities were captured in logs and sequence pat- 
tern mining methods was used on these logs to extract ac- 
tion sequences which in principle are similar to sequences 
that we consider in this paper. Different types of strategies 
based on these sequences were analyzed in multiple stud- 
ies [16, 17, 18] which also mapped these sequences to perfor- 
mances to compare and analyze strategies followed by high 
performers to those followed by low performers. Sequence 
learning has also been used to extract strategies for self- 
regulated learning [3]. More recently, a study performed 
large-scale sequence pattern mining in MOOCs platform to 
analyze activity sequences of learners [37]. Further, in the 
context of conversational tutors, tutorial dialogues can be 
treated as sequence of actions based on language-as-action 
theory [2, 33]. These sequences which are akin to strategies 
are mapped into a taxonomy by education experts [26]. Ap- 
proaches have been developed to recognize these sequences 
from natural language interactions to help automated tutors 
understand successful strategies to guide a student. In par- 
ticular, sequence learning methods have been used for this 
task as well [32, 24]. Symbolic models such as Markov Logic 
have also been applied for recognizing these sequences us- 
ing joint inference [36]. In general, Neuro-symbolic models 
have gained prominence recently and have found applica- 
tions in problems that have graph structure. In [22], the 
authors provide a detailed survey of Neuro-symbolic mod- 
els using graph neural networks. In complex problems such 
as visual question answering that require connections be- 
tween language and image processing, Neuro-symbolic mod- 
els have performed better than pure neural network based 
methods [38]. Our proposed application in this paper is 
further validation that Neuro-symbolic AI is a promising di- 
rection to solve complex problems. 


3. SEQUENCE LEARNING MODELS 


Student strategies can be defined in different ways. In par- 
ticular, the definition of what constitutes a strategy also 
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depends upon the type of interaction the student has with 
the AIS. In our case, we only consider structured interac- 
tions with discrete steps. Therefore, we define the student 
strategy as a function of the sequence of steps in the in- 
teraction. Each step is characterized by the central concept 
that is utilized to solve that specific step, i.e., the knowledge 
component [20] (KC) used in that step. Therefore, we define 
strategy in our case as a sequence of KCs used by a student 
in a problem solving session. Note that, this formulation of 
strategy as a sequence of discrete components is similar to 
the definitions used in prior work [30]. Formally, 


DEFINITION 1. Given a student s and a problem p, we 
define the strategy as Xs.) = KY Pere KK, where ko is the 
knowledge-component that s uses to solve the i-th discrete 
step in p. 


We can now formulate a learning problem as follows. Given 
training data, {Xs,,p,; }}27',-1, where n is the number of stu- 
dents and m; is the number of problems solved by the i-th 
student, we learn a model P such that for a student s’ and 
problem p’, P generates a sequence of knowledge compo- 
nents Ky nen Ke Note that students sometimes use 
more than one KC per step, in this case, we just unroll these 
multiple KCs by repeating the step with each of the KCs. 
Therefore, for the rest of this paper, we treat both multiple 
KCs in a step and single KC steps without distinguishing 
them. Also, to keep notation simpler, instead of adding the 
subscripts s;,p; each time, we denote the training input and 


output pairs as {xi, yi}. 


3.1 HMM Model 

A popular model that is often used to learn from sequential 
data is the Hidden Markov Model (HMM), where we assume 
that every time-step is dependent on the previous time-step. 
HMMs are generative models represented using a dynamic 
Bayesian network. Each step in the time series is encoded by 
a hidden-state variable in the Bayesian network. We connect 
the hidden node corresponding to time-step 7 to the hidden 
node j+1 in the network and the observed feature at step 7 is 
also connected to the hidden node at step j. In our case, we 
encode the HMM as follows. Let O; be the random variable 
representing the knowledge component at step t. We encode 
the knowledge component at step t as a vector O: (the value 
of the random variable O;) using a one-hot encoding. Let 
Zt be the hidden-state variable corresponding to step t. The 
emission probability is given by P(O,|Z;) and the transition 
probability is given by P(Z:|Z-1), i-e., the hidden state at 
time t depends only upon the hidden state at time t—1. The 
transition matrix is a k x k matrix that specifies transition 
probabilities to a state at time t given any other state at 
time t — 1. Here, k which specifies the number of hidden 
states is pre-defined in the model. 


The learning task in the HMM is to estimate the parameters 
of the HMM, namely, the transition matrix and the param- 
eters of the emission probability distributions. Note that we 
need to estimate conditional probability conditioned on each 
possible state of the latent variable. To do this, we assume a 
Gaussian distribution represents the probability of each hid- 
den state and the emission probabilities are also Gaussian 
distributions. Using the EM (Expectation Maximization) 


algorithm, we compute the parameters of the distributions 
using Max-likelihood estimation which has guarantees on 
convergence to a local optima. For predicting the strategy, 
we sample a KC at the first time step. Then, given a KC 
at any time step t, we generate the KC at time step t+ 1 
as follows. Based on the observation, i.e., KC at step t, we 
predict an equivalent hidden state representation at step t 
and using the transition probability matrix, we sample the 
hidden state at time step t+ 1. We then predict the KC at 
time step t+ 1 using the emission probability. 


3.2 LSTM Model 

One of the problems with HMMs is that they have restric- 
tive assumptions, i.e., each step depends only on the pre- 
vious step. Ideally we would like to consider the student’s 
activity across several steps to determine what his/her next 
step in the strategy is likely to be. For instance, suppose 
we have a student who works out a problem using a divide- 
and-conquer strategy, then there may be several small sub- 
problems that the student solves before combining them to- 
gether. In this case, a HMM model that simply looks at the 
previous step performed by a student may be able to capture 
the local strategy but will typically be unable to infer the 
global strategy since the dependencies may run across sev- 
eral steps. Therefore, to infer such advanced strategies, we 
need a more sophisticated model that captures longer-term 
dependencies. Long Short Term Memory (LSTMs) [11] are 
a variant of recurrent neural networks that have been used 
successfully for several problems like modeling text data. 
In particular, LSTMs can exploit longer range dependen- 
cies across words/sentences to learn a latent representation 
of sentences/documents. In our case, we apply LSTMs to 
learn a latent representation of the strategy. 


Unlike HMMs, LSTMs are discriminative models that pre- 
dict an output for step t based on the features observed 
in step t as well as a hidden state vector that summarizes 
the information up to step t — 1. Note that a bi-directional 
LSTM can also consider information in steps succeeding t. 
To learn an LSTM for our task, we construct a tensor T 
€ R™*”"** where m is the number of training instances, n 
is the number of steps and k is the dimensionality of fea- 
tures representing each training instance (s, p). Note that 
we can represent variable-length strategies using a special 
Start and Stop symbol in the LSTM to denote the start and 
end of strategies. The output of the LSTM for the t-th step 
is a vector representing the KC at step ¢t. The final hidden 
state vector summarizes information for the full strategy for 
an input instance. 


4. NEURO-SYMBOLIC MODEL 


Though an LSTM can be directly used to learn a model 
for strategy prediction, it has certain limitations. LSTMs 
are known to converge very slowly for large datasets [39]. 
Further, LSTMs treat each instance in the training data 
as iid (independent and identically distributed) which is 
limiting when there are underlying relationships among the 
instances. For example, problems are related to each other if 
solved by the same student, KCs used by the same student in 
similar problems are related to each other, etc. We address 
these limitations combining LSTMs with a symbolic model. 


Neuro-symbolic AI [7], namely, combining symbolic AI mod- 
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els with DNNs has gained significant attention in tasks such 
as Visual Question Answering [38]. Neuro-symbolic models 
augment deep learning with knowledge from symbolic mod- 
els. This can help learn deep models more efficiently even 
with limited labeled data [35]. Further, augmenting DNNs 
with symbolic models also controls overfitting. Specifically, 
since DNNs are highly expressive models, they are known to 
sometimes overfit the training data, particularly when the 
training data is non-diverse and in many problems such as 
image classification, data augmentation methods [5] seek to 
increase data diversity. In our Neuro-symbolic approach, we 
represent relationships in our dataset using the language of 
Markov Logic [9]. Based on symmetries in the relationships, 
we sample a smaller, more diverse training dataset to train 
the LSTM efficiently. 


4.1 Markov Logic 


Markov Logic [9] is a symbolic AI model designed as a 
representation and reasoning language for relational data. 
Markov Logic specifies relationships using the language of 
first-order logic. Each formula in Markov logic specifies a 
logical relationship using variables which can be substituted 
by symbols (also called constants or objects) from the data. 


For example, we can model the fact that if two problems are 
similar then they require the same KC as a formula such as 
KC(pi,k) A Similar(pi,p2) = KC(p2,k). We can substitute 
this general formula using symbols in the data, say two prob- 
lems P, and P: and KC K to obtain the formula, KC(P:, K) 
A Similar(P;, P2) => KC(P2,K). The formula that is sub- 
stituted with the symbols is called a ground-formula us- 
ing terminology from first-order logic. The predicates that 
are substituted with symbols are called ground-atoms, e.g., 
KC(P,, K) is a ground atom of predicate KC. 


A Markov Logic Network (MLN) can be represented as a 
graph where the relationship encoded in each ground for- 
mula is represented by a clique in the graph. A clique is 
activated if the logical relationship specified by the ground- 
formula corresponding to that clique is evaluated to True in 
the data. For instance, in the above example, the clique cor- 
responding to KC (Pi, K), Similar (P,, P2) and KC (P2, K) 
is activated if the data asserts the logical relation that prob- 
lems P,; and P2, use KC K and problem P, is similar to 
P2. In MLN semantics, each activated clique represents a 
function parameterized by a weight attached to the ground 
formula. The full graph represents an undirected probabilis- 
tic graphical model [21]. For large datasets in real-world 
applications such as ours, the number of ground-formulas 
become very large resulting in an extremely large graph. 
In the standard use of MLNs, we learn parameters for the 
MLN based on Max-Likelihood Estimation (MLE). How- 
ever, computing the gradient for MLE is infeasible when the 
graph is constructed from large datasets such as ours. Note 
that though, parameters for the MLN are required only if 
we want to use the MLN directly for probabilistic inference, 
i.e., when we want to answer queries using the MLN. While 
this is certainly desirable, it is well-known that MLN infer- 
ence/learning algorithms cannot scale up to large datasets 
and perform extremely poorly in such cases [15]. Therefore, 
in our case, we make a simplifying assumption in the MLN 
that all the parameters for the graphical model have uniform 
weights. Thus, in our Neuro-symbolic model, Markov Logic 


Figure 1: Illustrating symmetries in the MLN object graph. 
The graph represents symbols/objects in the MLN and the 
edges represent connection between variables, i.e., if objects 
appear together in a formula, they are connected in the 
graph. In (a) student S; can be exchanged with student 
S2 and problem P; with P2 to get an isomorphic graph. In 
(b) if the KCs Ky and Ko are similar, then the exchange is 
approximate. 


is only used as a language for knowledge representation (KR) 
and not for inference/learning. That is, formulas specify re- 
lationships/connections in the MLN graph between differ- 
ent entities in our dataset (e.g. problems, students, etc.) 
while the actual learning and predictions are performed by 
an LSTM model. Note that in theory, other forms of KR 
such as Bayesian networks, arithmetic circuits or probabilis- 
tic programs can be used. However, the benefit of MLNs 
is that they specify relationships over large data using com- 
pact first-order formulas. Next, we describe the formulas in 
our MLN followed by how we learn symmetries in the graph 
to train the LSTM more efficiently. 


4.1.1 MLN Structure 
Our first set of MLN formulas relates the KCs to the problem 
and the student solving the problem. 


Student(s) A Problem(p) \ PHierarchy(p, h) => KC(s, p, t, k) 


where s is a variable that denotes a student, p denotes a 
problem, t is a step, k denotes a knowledge component and h 
denotes the problem hierarchy which is the hierarchy of cur- 
riculum levels containing the problem, and PHierarchy(p,h) 
relates to the problem p in the hierarchy h where the hi- 
erarchy contains the curriculum unit name and the section 
name that the problem belongs to (e.g. Unit LCM, Section 
LCM-2). 


Next, we encode the homophily property where the same 
KC is likely to be reused by a student for problems that are 
related to each other through a common problem hierarchy. 


Student(s) A Problem(pi) A Problem(p2)A 
PHierarchy(pi,h) \ PHierarchy(p2,h) A KC(s, pi, t1, k) 
=> KC(s, po, te, k) 
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Next, we encode transitive dependencies between KCs by 
relating KCs that occur close to each other. Specifically, this 
encodes local dependencies across KCs (similar to a HMM 
model). 


KC(s, p,t — 1,k1) AKC(s, p, t, ka) 
=> KC(s,p,t +1, ks) 


4.2 Embeddings 


Given the formulas, we can define the MLN object graph 
by connecting objects that appear in the same ground for- 
mula. For instance, consider an example MLN object graph 
shown in Fig. 1 (a) with two students (Si, S2), two prob- 
lems (Pi, P2) and three knowledge components (Ki, Ko, 
K3). The edges indicate that in the graph corresponding 
to the MLN, there is a connection that relates the corre- 
sponding symbols. Note that S$; works on P; and S2 works 
on P:, where P; and P2 are related since they correspond 
to the same topic. Suppose both S1 and S2 use the same 
strategy, Ki, Ke, K3, then, we can exchange S; with S2 and 
P, with P2 to get an isomorphic graph structure. Now, sup- 
pose $2 uses a strategy Ki, KS, K3 that is slightly different 
from the strategy of $1, say, Ki, KY, K3, then we obtain a 
graph structure shown in Fig. 1 (b), where the new connec- 
tions are shown by dotted lines. Now, exchanging 5; with 
So and P; with P2 will not give us a graph structure that 
is isomorphic to the original graph. However, suppose that 
the knowledge components K4 and K%' are similar to each 
other, i.e., there are many other problem instances where 
students use the KCs KS and K% interchangeably, then ex- 
changing S;, P, with S2, P2 will yield an approximation 
that is still quite similar to the original graph. This means 
that using (51, Pi), it is reasonable to obtain a model that 
can predict the strategy for (S2, P2) and vice-versa. The 
set {(51, P1), (S2, P2)} is therefore an equivalence class con- 
sisting of approximately symmetrical instances. In general, 
if we group together approximately symmetrical instances 
in our training data, then we can train our LSTM model 
with diverse instances by sampling these groups since each 
group represents data that is likely to have a similar effect 
in training the model. We do this by learning embeddings 
for nodes in the graph structure. 


Unfortunately, the size of the graph becomes very large as 
we increase dataset size and it is practically infeasible to 
construct the graph structure explicitly. Though several ap- 
proaches have been developed that identify symmetries in 
MLNs using graph automorphism groups [27] using tools 
such as saucy [6], these approaches generally work directly 
on the graph structure. Further, it is possible to infer sym- 
metries in graphs using other neural-network based methods 
such as Node2Vec [10] and Graph Convolutional Networks 
(GCNs) [19]. However, all these approaches work on general 
graphs and considering an MLN graph as a general graph 
is problematic since the graph becomes very large even for 
smallish datasets. This makes it hard to apply such ap- 
proaches to our strategy identification problem since we ex- 
pect to have an extremely large graph. Therefore, we instead 
use a recent, much more scalable Markov Logic graph spe- 
cific approach called Obj2vec [12] that detects symmetries 
without explicitly constructing the graph. This approach 
is based on identifying approximate symmetries based on 
neighborhoods of a node using a neural network without 


constructing the actual graph. 


4.2.1 Obj2Vec 

Obj2Vec is inspired by skip-gram models [25] which are 
widely used to learn word embeddings. In skip-gram models, 
we learn an embedding for a word based on its contezt, i.e., 
the neighboring words that it typically appears with in text 
documents. For words which have similar contexts, we learn 
similar vector representations. Word2vec [25] is arguably 
the most popular skip-gram model, where we train a neural 
network for learning the embedding. Specifically, for each 
word w as input, the neural network learns to predict the 
context of w. Typically, The inputs and context words are 
encoded as one-hot vectors. The hidden layer in the neural 
network typically has a much smaller number of dimensions 
as compared to the input/output layers. The hidden-layer 
learns a dense, low-dimensional embedding, where similar 
words have similar vector representations. This is because, 
words that are similar typically have similar contexts in text 
documents and therefore the neural network learns a similar 
representation in the hidden layer for such words. 


Obj2Vec extends the idea of word embeddings to MLNs. 
Specifically, recall that each ground formula of an MLN 
represents a clique in the MLN graph. For each activated 
ground formula, i.e., formula that represents a relationship 
that is supported by the data, we predict a symbol/object 
in the formula from other symbols/objects in that same for- 
mula. For example, suppose our data shows that Alice and 
Bob use the knowledge component Slope-Intercept across 
several problems. Then, all ground formulas that contain 
either Alice or Bob and the KC Slope-Intercept are acti- 
vated. In this case, both Alice and Bob are said to share a 
common context. Therefore, we predict the symbol Slope- 
Intercept from both Alice and Bob. That is, we have an 
autoencoder neural network where the input is a one-hot en- 
coding of Alice (or Bob) and the output is a one-hot encod- 
ing of Slope-Intercept. The neural network must therefore 
learn a common representation for both Alice and Bob since 
it needs to make similar predictions for both. Thus, suppose 
the hidden-vector representation (or embedding) for Alice is 
VAlice and that for Bob is vgop, then vAlice © VBov. Note that 
the embedding defines a continuous approximation of sym- 
metries in relationships specified in the MLN graph. That is, 
the distance between the vectors VAlice and vBo» quantifies 
the symmetry between Alice and Bob based on relationships 
specified in the data. 


4.3 Scalable Learning using Symmetries 

The embedding vectors from Obj2Vec encodes relational 
knowledge from the MLN graph. Given the embedding vec- 
tors, we now train an LSTM to predict the student’s strat- 
egy. Specifically, let our input instances be xi ... xv, where 
each x; consists of embeddings for a specific student s solv- 
ing problem p, and the outputs are y1 ... yw, where y; is a 
sequence of KCs used by the student s to solve the problem 
p. The LSTM training objective is given by, 


= are min ay DLC 8),¥s) (1) 


where 6* and @ represent the parameters of the LSTM, CL is 
a loss function and w(x:,0) is the sequence of KCs output 
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by the LSTM parameterized by @ for input x;. In general, a 
stochastic gradient descent (SGD) procedure can be used to 
minimize the objective in Eq. (1). In SGD, we sample the 
training instances to approximate the gradient. Typically, 
SGD assumes that all training instances are equally impor- 
tant, and therefore samples them uniformly. That is, the 
probability of sampling a specific instance in the training 
data is equal to p ees However, this approach is expen- 
sive particularly if we repeatedly choose training instances 
that are similar to each other. For example, suppose all the 
training instances that we sample are likely to encode similar 
strategies, then our model may take a long time to under- 
stand diverse strategies. Further, since the underlying data 
encodes symmetries (from the MLN graph), the information 
in one training instance may be very similar to the informa- 
tion in another symmetrical instance. Therefore, we force 
the model to learn from instances with diverse relationships 
by imposing an importance distribution over the training 
data. Specifically, training instances with larger importance 
are more likely to be chosen as compared to training in- 
stances with smaller importance. 


In general, to focus the training on more important data in- 
stances, we can modify the sampling distribution such that 
each instance is sampled with a non-uniform probability. 
This approach has been explored in prior work, where we 
scale up training by replacing the uniform distribution over 
the training instances with an importance distribution that 
quantifies how important a specific example is for the train- 
ing process [14]. Previous work such as [14, 1, 13], have 
focused mainly on approximating importance as a function 
of the gradient norm which is hard to compute exactly. In 
[14], therefore, the authors propose an approximation to the 
gradient norm and use this to target important training ex- 
amples. The focus in these approaches is to target the train- 
ing examples that are likely to induce changes when updat- 
ing the model parameters during backpropagation which can 
be shown to translate to a reduced variance in the gradient 
estimates. However, in our case, we have more informa- 
tion apriori in the embeddings to identify importance of a 
training example in terms of their relationships. Specifically, 
recall that the embeddings are based on symmetries in the 
MLN-graph which encodes relational knowledge. Thus, if 
two embeddings are similar, then it means that they share 
similar relationships. For example, if two student embed- 
dings are similar, then it is likely that for the problems both 
students have solved, their strategies use similar KCs. Thus, 
using embedding-similarities, our model focuses the training 
effort on instances that encode diverse relationships. 


4.3.1 Adaptive Importance Weighting 

Given the instances {x:, yithy, we cluster the instances us- 
ing K-Means clustering. Each instance internally has two 
components, the student embedding as well as the problem 
embedding. Since we want to exploit symmetries in both, 
we cluster them along both dimensions. Let {C$}, and 
{CH }22, denote the clusters found by K-Means using the 
student embeddings and the problem embeddings respec- 
tively, where n; and nz are the number of clusters. We 
now sample from each cluster to obtain a reduced set of 
training examples. For each cluster, we assign an impor- 
tance weight to quantify how often we need to sample that 
cluster. Let g; represent the importance weight of the i-th 
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Figure 2: Schematic illustration of our proposed approach. 
We learn embeddings based on symmetries in the relational 
data. We then learn a one-to-many LSTM to map an in- 
put instance to a sequence that represents the strategy by 
focusing the training on a sample of the data that reflects 
symmetries in the data. 


cluster. Ideally, we want to quantify g; based on the symme- 
tries encoded within the cluster. Specifically, if the cluster 
contains highly symmetric instances, then we require fewer 
training examples from the cluster. On the other hand, if 
the cluster contains diverse instances, we require more train- 
ing samples from that cluster. One way to quantify this is 
using traditional clustering metrics such as within-cluster 
sum of squared errors (SSE) to measure cohesion within the 
cluster. That is, if the embedding vectors are close to each 
other within a cluster which implies a small SSE, then it is 
indicative that we may need fewer samples from that clus- 
ter. However, this approach may not necessarily yield the 
best results since both the embeddings and the clustering are 
simply approximations to true symmetries in strategies. For 
example, suppose the embedding vector for Alice is close to 
the vector for Bob, this means that considering all problems, 
Alice and Bob are approximately symmetrical. Similarly, if 
the vector for problem P, is close to problem P2, this means 
considering all students, P; and P2 are approximately sim- 
ilar. However, this may not always necessarily imply that 
the strategy followed by Alice for problem P; is guaran- 
teed to be symmetrical to the strategy followed by Bob for 
P,. Therefore, we use an adaptive approach to progressively 
learn these symmetries. 


For training the model, from each cluster, we may require 
varying number of samples. That is, from clusters represent- 
ing less symmetrical instances, we may require more samples 
while from other clusters that contain more symmetrical in- 
stances, we may require fewer samples. To account for this, 
we adapt the sampling as follows. We define the initial im- 
portance weight for the i-th cluster as gh = rai where K is 
the number of clusters. Let 0% be the parameters learned 
by the LSTM in iteration using samples from the clusters 
in iteration j. In iteration 7 + 1, we update the importance 
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weight for the clusters as, 
; fener ; : : 
41 i é 
ge? = — ST L(x? 0), yp”) (2) 
k=1 


where x() is a training example that consists of a randomly 
sampled instance from the 7-th cluster. For example, if we 
are updating the 7-th student cluster, x is arandomly sam- 
pled student s from this cluster combined with a problem 
p, where p is a problem that has been solved by s. Thus, if 
the LSTM in iteration j effectively encodes the symmetries 
in the 7-th cluster, then the loss in Eq. (2) is likely to be 
small. Thus, git) is small. On the other hand, if gir) is 
large, then we need more samples from the i-th cluster since 
the LSTM trained using the samples collected until iteration 
j does not effectively model the instances in the i-th clus- 
ter. The importance weights for the i-th cluster adaptively 
change from qh gs? shah gq. Note that each iteration adds 
training data to the LSTM and thus effectively increases 
training time. However, we do not begin the training from 
scratch in each iteration. Specifically, for iteration 7 +1, we 
consider the LSTM learned until iteration 7 as the starting 
point. This is similar to pre-training that is common in deep 
network training. For iteration j +1, the 09 ) represents the 
pre-trained parameters of the LSTM. We stop adapting the 
weights after a fixed number of iterations depending based 
on a cutoff for the training time. Note that more advanced 
convergence criteria can be explored here which is a part of 
our future work. To summarize, Fig. 2 shows a schematic 
representation of our overall model. 


5. EXPERIMENTS 


5.1 Setup 

We evaluate our approach on the publicly available KDD 
EDM challenge datasets, Algebra 2008, 2009 and Bridge 
to Algebra 2008, 2009 [34] which consists of data collected 
from the Mathia platform. Each instance consists of several 
discrete steps and each step is mapped to a knowledge com- 
ponent which is used to solve that step. The statistics for the 
two datasets are shown in Table 1. As shown in the table, 
these datasets are quite large with over 850K and 1.6M in- 
stances respectively. All our experiments were performed on 
a 64GB memory machine with a Nvidia GPU and an Intel 
Core-I9 processor. For computing accuracy, in each input 
instance, we compute the percentage of total steps where 
the true KC matches with the predicted KC. The overall 
accuracy is computed as the average accuracy across all in- 
stances. To measure variance of our estimates, for each of 
the results shown, we run the experiments 10 times and com- 
pute the mean accuracy and the standard deviation of the 
accuracy. Next, we describe the implementation of different 
approaches that we use in our experiments. The code and 
data for our implementations are available here’. 


5.1.1 Hidden Markov Model 

We trained a Gaussian Hidden Markov Model (we refer to 
this as HMM) using the sklearn package in python. We set the 
number of hidden states to 100 and initialized the Gaussian 
emission probabilities with a full covariance matrix so that 
it has flexibility to generate varied sequences. We trained 
the model using the EM algorithm. 


"https: //github.com/anupshakya07/SSPM 


Dataset Total Instances No. of Students No. of Problems No. of unique KCs 
algebra_2008_2009 838728 3310 188368 541 
bridge_to_algebra_2008_2009 1624951 6043 52754 933 


Table 1: Details of the dataset. 


Learning-rate Optimizer Batch-size Dropout-rate LSTM (hidden state) Obj2Vec embedding 
0.001 Adam with CCE-Loss 100 0.3 


200 dimensions 300 dimensions 


Table 2: Parameters for training. 


5.1.2. LSTM and Neuro-symbolic Models 


We implemented a one-to-many LSTM using TensorFlow 
and Keras. For the pure (or vanilla) LSTM, we encode 
inputs as one-hot-encoded vectors representing the studen- 
tID, problemHierarchy and problemName. For the Neuro- 
symbolic model, we vectorize each instance, using a publicly 
available implementation of Obj2Vec [12] using the MLN 
formulas as specified in the previous section. Obj2Vec in- 
ternally uses Gensim [28] to compute the embeddings for 
each symbol in the MLN. We use the embedding vectors of 
studentID, problemHierarchy and problemName as input to 
the LSTM encoder. Special Start and End tokens are used 
in the decoder section of the LSTM to identify the start and 
end of a prediction. The decoder unit predicts the KC at 
each time step, until an End token is found. To train the 
models in a feasible manner, we used a timeout of 3 hours. 
Within this timeout period, it was infeasible to use the all 
the instances since the training for the LSTM did not con- 
verge. Therefore, we randomly sampled instances to train 
our model within the specified limit. We refer to the trained 
models using random sampling as LSTM-Random and LSTM- 
NS-Random for the vanilla LSTM and the Neuro-symbolic 
models respectively. We further implemented a stratified- 
sampling/group based training on students and problems. 
For sampling by student, we selected N students from the 
student pool and for each selected student, we sampled M 
problems solved by that student. For sampling by problems, 
we selected N problems from the problem pool and sampled 
M students who have solved those problems. By increasing 
values of M and N, we progressively increased the instances 
as we show later in the results section. We refer to this as 
LSTM-NS-NaiveGroup. 


5.1.3 Adaptive Training 

We implemented K-Means clustering to cluster the data 
based on the embeddings and sampled from these clusters. 
We implemented a non-adaptive training model as follows. 
We independently clustered the students and the problems 
to generate C; student clusters and C2 problem clusters. We 
then sampled one student from each student cluster and one 
problem from each problem cluster nearest to the cluster 
centers to create a training set of at most C) * C2 instances 
that effectively covers all instances in our training data. We 
increase the number of clusters progressively starting from 
100 student clusters and 1000 problem clusters to increase 
the number of instances in training as we show later in our 
results. We refer to this approach as LSTM-NS-Clustered. 
For our proposed adaptive weighting approach, we sample 
each cluster according to an importance weight. In each it- 
eration, we update the importance weight of a cluster based 
on predictions made on a randomly sampled set of instances 
that were not used in training from that cluster. Here, we 
used 100 student clusters and 1000 problem clusters and 


124 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


100 T T T T 
“, 80 + | _ 
& S = 
> S Fa 
Q n, SG 
2 60 | = 60 | 2 60} | 
5 FA 5 
8 3 8 
<= 8 < 
B 40+ 4 Ps 40 | B40 | 
& S 
a & 3 
ra —4— LSTM-NS-Random a —4— LSTM-NS-Random 
207 —S}- LSTM-NS-NaiveGroup 20 pater ske sso oe —S- LSTM-NS-NaiveGroup || 
—x— LSTM-NS-Clustered —x— LSTM-NS-Clustered 
—@— LSTM-NS-Adaptive —o— LSTM-NS-Adaptive 
0 1 i 1 i I I 0 SS ES LS ST ol a 1 1 
0 02 04 06 08 1 12 14 16 0 02 04 06 08 1 12 14 16 0 0.2 0.4 0.6 0.8 1 
Number of Instances -10° Number of Instances “10° Training Time (secs) -10* 
(a) Train (b) Test (c) Train 
100 SS 100 100 T 
| 80 80 |- | 
= SS => 
2 | = 60 e 60) 
3 | f 
£ a £ 
3 3 3 
8 =< 8 
<= 40+ + 2 40+ | <= 40 
al & 
—4— LSTM-NS-Random —4— LSTM-NS-Random —4— LSTM-NS-Random 
20° 7 20/7 —S—LSTM-NS-NaiveGroup |] 20) 5 LSTM-NS-NaiveGroup |] 
—*— LSTM-NS-Clustered —x— LSTM-NS-Clustered —x— LSTM-NS-Clustered 
—e— LSTM-NS-Adaptive —6— LSTM-NS-Adaptive —e— LSTM-NS-Adaptive 
0 ! \ n I n h I n ! 
0 0.2 0.4 0.6 0.8 1 %% 0.2 04 06 08 1 12 14 16 %% 0.2 04 06 08 1 12 14 16 
4 : : 
Training Time (secs) 10 Number of Instances -10° Number of Instances -10° 
(d) Test (e) Train (f) Test 
100 100 T T T 
_— 80 80 + | 
= < 
B 3 
2 60 = 60} ; 
3 3 
3 5 
<= 8 
2 40} j S  40F 4 
3 a 
a —#— LSTM-NS-Random —4— LSTM-NS-Random 
20 F —G LSTM-NS-NaiveGroup |] 20 fF 5 LSTM-NS-NaiveGroup |] 
—x— LSTM-NS-Clustered —x— LSTM-NS-Clustered 
—o— LSTM-NS-Adaptive —e— LSTM-NS-Adaptive 
1 1 1 i i 1 1 
% 0.2 0.4 0.6 0.8 al % 0.2 0.4 0.6 0.8 nt 
Training Time (secs) 10" Training Time (secs) 10" 
(g) Train (h) Test 


Figure 3: Accuracy results, the shaded portions show the standard deviation and the mean accuracy is plotted in the graphs. 
(a)-(d) corresponds to Bridge to Algebra 2008, 2009 results and (e)-(h) corresponds to Algebra 2008, 2009 results. 
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changed the importance weights of the clusters adaptively. 
We refer to this approach as LSTM-NS-Adaptive. The pa- 
rameters for training our models are shown in Table 2. 


5.2 Results 


Fig. 3 compares the accuracy for different approaches. Fig. 3 
(a), (b) shows the training and test accuracy respectively as 
we vary the number of instances used in training the models 
for the Bridge to Algebra 2008, 2009 dataset. As we can 
see from these plots, LSTM-NS-Adaptive obtains the highest 
accuracy compared to the other methods. HMM (not shown 
in figures) gave us an accuracy of less than 5%. This low 
accuracy indicates that the strategy for diverse (or asymmet- 
ric) groups of students (or problems) cannot be represented 
by the same transition matrix. One possible approach that 
we will explore in future is to integrate our approach with 
HMMs, i.e., we learn an ensemble of HMMs where a HMM 
learns strategies for a symmetric group. A naive LSTM 
without embedding vectors, i.e., where inputs are a simple 
one-hot encoding of student and problems, also yields poor 
accuracy (less than 10%). This illustrates the importance of 
the latent features in the embedding vectors. 


Note that we have only shown the results for the best per- 
forming model with latent dimension as 200 (experiments 
were carried out with different dimensions). As shown in 
Fig. 3 (a), (b), LSTM-NS-Adaptive and LSTM-NS-Clustered 
require a small fraction of the number of instances to obtain 
accuracy that is higher than LSTM-NS-Random and LSTM-NS- 
NaiveGroup. Further, the variance in accuracy of LSTM-NS- 
Random and LSTM-NS-NaiveGroup is much higher as shown 
by the shaded portion around the line plots compared to 
the variance of LSTM-NS-Clustered and LSTM-NS-Adaptive. 
LSTM-NS-Adaptive also obtains higher accuracy than LSTM- 
NS-Clustered as we adapt the weighting. However, note 
that the test accuracy starts to dip after the accuracy hits 
a peak value for LSTM-NS-Adaptive. This is because the 
adaptive cluster weights may focus too strongly on certain 
clusters in the training data which causes the LSTM to over- 
fit. Therefore, in practice, we can stop the adaptation based 
on a validation set. Further, we can clearly see that exploit- 
ing symmetries in training leads to better generalization in 
Fig. 3 (b) where we see a significant difference between the 
accuracy for LSTM-NS-Adaptive and LSTM-NS-Clustered as 
compared to LSTM-NS-NaiveGroup and LSTM-NS-Random at 
smaller training sizes. Fig. 3 (c) and (d) also show the us- 
ing relational symmetries in training results in a significant 
improvement in scalability since it shows that we can train 
LSTM-NS-Adaptive and LSTM-NS-Clustered in around half 
an hour to achieve an accuracy that is higher than the accu- 
racy we obtain even after around 3 hours of training time in 
approaches where we do not choose training examples based 
on symmetries. 


The results for the Algebra 2008, 2009 dataset are similar 
to the ones for Bridge to Algebra 2008, 2009. As seen 
in Fig. 3 (e) and (f), the LSTM-NS-Adaptive model is the 
best performing model in terms of accuracy and it uses a 
small number of training instances to achieve this accuracy. 
Similar to the previous results, the variance for LSTM-NS- 
NaiveGroup and LSTM-NS-Random is much larger than that 
for LSTM-NS-Adaptive and LSTM-NS-Clustered. The train- 
ing time shown in Fig. 3 (g) and (h) follow a similar pat- 


100 ot % 


. &* 
% ~, Sig 


8 
e 
eo. 


~ 
™‘ 
x 


-100 


-150 -100 50 0 50 100 150 


Figure 4: T-SNE visualization of strategies. The hidden 
layer of the final step in the LSTM is visualized for 100 test 
problems over all students. T-SNE reduces the latent LSTM 
vector to 2-D for visualization. Data points close together 
correspond to approximately similar strategies. 


tern where LSTM-NS-Adaptive and LSTM-NS-Clustered can 
achieve high accuracy scores even with short training times 
since they take advantage of relational structure in the data. 


Table 3 shows the accuracy of predicting strategies for differ- 
ent problem units. Specifically, each problem in the dataset 
corresponds to a specific unit and we evaluate the models 
by testing the trained model on problems specific to a unit. 
For lack of space, we have not provided an exhaustive set 
of results for all units since there were around 50 units in 
the dataset. Instead, we provide accuracy results for the 
10 units with largest number of data instances from the 
Bridge to Algebra 2008, 2009 dataset. We see that on 
majority of the units, LSTM-NS-Adaptive has the best accu- 
racy score. LSTM-NS-Clustered is the next best performing 
method. The difference in accuracy between units was sig- 
nificant in some cases. For instance, LSTM-NS-Adaptive had 
a very high accuracy for the unit PERCENT CONVERSION but 
a much lower accuracy for ONE-STEP-EQUATIONS and TWO- 
STEP-EQUATIONS. This may be due to higher complexity in- 
volved in solving equations as compared to problems involv- 
ing percent conversion which may add to uncertainty in pre- 
dicting strategies. 


5.2.1 Structure in Strategies. 

Finally, Fig. 4 shows a visualization of the strategies through 
a T-SNE plot. Specifically, we wanted to analyze if there are 
true patterns in the strategies. To do this, we use the LSTM- 
NS-Adaptive model to predict the strategy for 100 problems 
across all students. We show the results for Bridge to Al- 
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LSTM-NS-Random LSTM-NS-NaiveGroup LSTM-NS-Clustered LSTM-NS-Adaptive 


iia (%) (%) (%) (%) 
PROBABILITY 67.7 53.48 TTAB 70.73 
FRACTION- 
OPERATIONS-1 74.95 78.23 84.02 90.49 
PERCENT- 
CONVERSION 99.01 99.18 88.6 99.5 
SCL ef 
NOTATION 66.67 69.39 69.9 70.16 
PICTURE- 
ALGEBRA-2 90.23 96.17 93.29 95.17 
RATIONAL- 
NUMBER- 30.76 41.85 40.28 66.13 
OPERATIONS 
INTEGERS 87.47 78.83 93.52 96.78 
ORDER-OF- ; 
OPERATIONS 60.03 50.35 45.53 69.43 
ONE-STEP- 
EQUATIONS 66.6 53.36 58.86 58.06 
TWO-STEP- . : 
EQUATIONS 63.13 63.03 59.42 61.5 


Table 3: Comparing accuracy on test sets corresponding to different units in Bridge to Algebra 2008, 2009. 


Identify number Identify number | Identify proper | Identify number | Identify number _| Identify proper 
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Identify number Identify number _| Identify improper | Identify number | Identify number _ Identify improper _ 


of items of recipients fraction from of items of recipients fraction from 
option 1 option 2 


Figure 5: Strategies for different problems that have similar 
vector representations. 


gebra 2008, 2009. We use the hidden-vector from the last 
step of the LSTM as a representation of the strategy for a 
specific input instance. That is, this vector encodes infor- 
mation summarizing all the steps (or the full strategy) that 
the student performs when solving the input instance. We 
then plot this using the T7-SNE plot that reduces the high- 
dimensional representation to a 2-D representation. Fig. 4 
shows that there is a separation of different groups of strate- 
gies. The presence of such clusters of strategies indicates 
that there are indeed structures in student strategies and 
the representation learned by LSTM-NS-Adaptive discovers 
these symmetric structures. 


Fig. 5 and 6 show two examples where the vectors repre- 
senting the strategies are close to each other. Fig. 5 shows 
a case where the two strategies correspond to two different 
problems, one of which is related to proper fractions and the 
other to improper fractions. However, at a high level, these 
strategies are similar which is reflected in their similar vec- 
tor representations. Another example shown in Fig. 6 shows 
a case where the two partial strategies shown correspond to 
the same problem are inversions of each other. That is, some 
of the steps in the two cases are performed in opposite or- 
ders. However, at a high-level, the strategies have symmetry 
which is reflected in their vector representations. 


Find Y, FindY, any |Enteringa | Entergiven, Using small Using simple Find Y, Find Y, any 
positive slope form given reading numbers numbers —_positive slope form 
words 


Using small Using simple | Find Y, FindY,any Enteringa Enter given, Using small | Using simple _ 
numbers numbers _positive slope | form given reading numbers numbers 
words 


Figure 6: Strategies for the same problem that have similar 
vector representations. 


6. CONCLUSION 


Predicting student strategies in problem solving can make 
AISs more engaging to students since the system can adapt 
itself to suit the student’s strategy. In this paper, we de- 
scribed a Machine Learning approach to predict student 
strategies from large scale, structured student interaction 
data. Specifically, we adopted a Neuro-Symbolic approach, 
ie., we combined LSTMs with a relational symbolic model 
to perform learning more efficiently. To do this, we encoded 
relationships in the data in the language of Markov Logic 
and based on relational symmetries in the data, we picked 
training instances are diverse. Doing this allowed us to learn 
our model to recognize diverse strategies at a smaller com- 
putational cost. Our evaluation on the KDD EDM chal- 
lenge datasets show that our approach generalizes better 
and has significantly smaller training times as compared to 
approaches that do not exploit relational symmetries during 
learning. In future, we will extend our approach to datasets 
with finer-grained learner information and also develop joint 
inference models connecting mastery and strategies. 
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ABSTRACT 


Open-ended questions in mathematics are commonly used 
by teachers to monitor and assess students’ deeper concep- 
tual understanding of content. Student answers to these 
types of questions often exhibit a combination of language, 
drawn diagrams and tables, and mathematical formulas and 
expressions that supply teachers with insight into the pro- 
cesses and strategies adopted by students in formulating 
their responses. While these student responses help to in- 
form teachers on their students’ progress and understand- 
ing, the amount of variation in these responses can make it 
difficult and time-consuming for teachers to manually read, 
assess, and provide feedback to student work. For this rea- 
son, there has been a growing body of research in devel- 
oping Al-powered tools to support teachers in this task. 
This work seeks to build upon this prior research by in- 
troducing a model that is designed to help automate the 
assessment of student responses to open-ended questions 
in mathematics through sentence-level semantic represen- 
tations. We find that this model outperforms previously- 
published benchmarks across three different metrics. With 
this model, we conduct an error analysis to examine char- 
acteristics of student responses that may be considered to 
further improve the method. 


Keywords 
Open responses, Automated scoring, Natural Language Pro- 
cessing, Sentence-BERT, Mathematics 


1. INTRODUCTION 


In many K-12 mathematics classrooms, teachers have come 
to rely on the use of open-ended questions to assess their 
students’ knowledge and understanding of assigned content. 
Unlike close-ended problems, where there is a single or finite- 
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number of accepted answers (e.g. a multiple-choice ques- 
tion), open-ended questions allow students to justify and 
express their thinking processes through language; it is com- 
mon that students may combine language, images, tables, or 
other mathematical expressions, equations, and terminolo- 
gies to illustrate their knowledge and understanding of the 
material. 


While the use of open-ended questions is not found only in 
mathematical contexts, aspects of this domain make it par- 
ticularly difficult to develop teacher supports for these types 
of question. Within computer-based learning platforms, re- 
search across fields of study have led to the development of 
a multitude of teacher-augmentation tools [1] and method- 
ologies that leverage machine learning techniques. Among 
these supports, automated methods have been developed 
and deployed to help teachers assess student essays and short 
answers in several domains [25, 2, 3, 15]. As was highlighted 
in [9], the arduous task of manually assessing and providing 
feedback to student open-ended work may explain the de- 
cline of open-ended questions assigned over the course of a 
school year (e.g. Figure 1 which shows the number of open 
response questions assigned within the ASSISTments learn- 
ing platform, aggregated over the last 10 years). In addition 
to this decline, as was also reported in [9], very few student 
responses to open-ended questions are ever scored by the 
teacher, with even fewer ever receiving feedback. Figure 2 
illustrates this, as well as the subsequent plot of these values 
from February through October of 2020, during COVID-19 
induced remote learning. 


There are several notable challenges in developing automated 
supports to help teachers assess student open-ended work. 
It is also the case that student responses to open-ended 
questions differ in the context of mathematical and non- 
mathematical domains. One such difference, for example, is 
that many non-mathematical domains such as history or lan- 
guage arts, student “open-ended” essays and short answers 
are often comprised of multiple sentences and paragraphs 
[21, 25, 5, 8], whereas in mathematics, responses are gener- 
ally shorter (maybe one or two, often incomplete sentences) 
[14, 9] that combine language with mathematical symbols, 
expressions, or other visuals. Aside from these response-level 
characteristics, however, several other student-, problem-, 
and even teacher-level factors can make the development of 
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Figure 1: The number of open response problems assigned 
over the course of a school year with the ASSISTments learn- 
ing platform, aggregated from 2010-2020. 


these automated supports more challenging; consider, for 
example, the variation in how teachers approach the assess- 
ment of student answers, using different inherent rubrics and 
pedagogical philosophies [15, 17, 22, 23]. 


While the examination of student answers to open-ended 
poses challenges in developing automated assessment sup- 
ports for teachers, prior work has shown promise in this con- 
text [9]. In that work, the authors explore several machine- 
learning and natural language processing (NLP) methods to 
predict teacher-provided scores to open-ended problems, of- 
fering an evaluation method and benchmark of comparison 
for similar methods’ 


In this paper, we build upon prior research presented in 
[9] to develop and evaluate an automated assessment model 
of student open responses in mathematics. We introduce a 
modeling approach using a sentence-level semantic represen- 
tation of the student open responses to the existing models 
through Sentence-BERT (SBERT;[20]), using a novel refor- 
mulation of the “score prediction” problem. We compare 
our method to the previously-developed scoring models from 
[9], and subsequently apply an exploratory error analysis to 
identify areas of improvement that may be addressed by fu- 
ture iterations of these methods. Toward this, we seek to 
address the following research questions: 


1. How does a model utilizing Sentence- BERT compare 
to previously developed approaches in predicting teacher 
given assessment scores for student response to open- 
ended problems? 


2. What are the characteristics of student answers that 
correlate with errors observed in our Sentence-BERT 
model? 


3. Which of student-, problem-, or teacher-level charac- 
teristics most explain the variance of error observed 


'The data and evaluation code from [9] was used in this work 
with permission from the original authors and in compliance 
with IRB. 
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Figure 2: The percent of student open-response answers that 
were scored and given written feedback by a teacher before 
and during remote learning in response to COVID-19. 


when the model is applied in real learning environ- 
ments? 


2. BACKGROUND 


There have been several works related to the automated 
scoring of open-ended responses in the past. Most of such 
works utilize a combination of Natural language Processing 
(NLP) and machine learning techniques of ranging complex- 
ity to process open-ended responses. Much of the existing 
work in this area has been applied in the context of non- 
mathematical content. Developments such as C-rater[15] is 
a well-cited approach that uses such methodologies to es- 
timate the assessed correctness of answers to short answer 
questions. This method uses grading rubrics and breaks 
down scores into multiple knowledge components to eval- 
uate each student response. Other works [2, 3] have im- 
plemented clustering techniques to grade short textual an- 
swers to questions. More recently, studies have based their 
approach around deep learning methods, which have led 
to promising improvements over previous benchmarked re- 
sults [21, 25]. While most of these works have been on 
non-mathematical domains, studies like [14] explore mathe- 
matical language processing using clustering techniques and 
the bag-of-words approaches for automated assessment of 
open-ended response in mathematics. However, this study 
only considers the mathematical content, discarding the non 
mathematical texts. 


Many of these more-recent studies have utilized publicly- 
released embedding methods trained on large corpuses of 
data, including those of Word2Vec [18] and GloVe [19], to 
model the semantic meaning of words. However, word em- 
beddings capture limited information about the semantics of 
a sentence, where the sequence of words may have large im- 
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Elena, Lin, and Noah all found the area of Triangle Q to be 14 square units but reasoned 
about it differently, as shown in the diagrams. Explain at least one student's way of thinking 
and why his or her answer is correct. 


Elena Lin 


Noah 


copied for free from openupresources.org 


Type your answer below 


Figure 3: Example of an open ended question taken from 
openupresources.org 


pacts on interpreted meaning. To capture the contextual in- 
formation within sentences and further increase the general- 
ization capabilities of NLP embedding methods, techniques 
such as Universal Sentence Encoders [4] and Sentence-BERT 
[20] generate a single embedding that is designed to be rep- 
resentative of the entire sentence while preserving the se- 
mantic and contextual information of the words within such 
sentences. 


One of the most commonly-used NLP embedding methods 
in recent years has been that of Bidirectional Encoder Rep- 
resentations from Transformers (BERT, [7]). Building upon 
and distinguishing itself from other methods such as GloVe, 
the BERT method is designed to incorporate contextual in- 
formation into generated embeddings to distinguish words 
that may have the same spelling but different meanings de- 
pending on usage (e.g. the word “bank” referring either 
a financial institution or perhaps a slope of land near a 
river); BERT has been shown to outperform many other 
approaches in a number of NLP tasks including, as is im- 
portant for this work, semantic textual similarity (STS) [7]. 
Sentence-BERT, or SBERT [20], modifies the pre-trained 
BERT network to reduce the computational overhead of 
BERT in order to also generate a sentence-level embedding 
of a given series of words. 


2.1 A Benchmark Comparison 

In this work, we are exploring the use of this SBERT method 
to build upon the prior benchmark set in Erickson et al., 
2020 ([9]) in assessing student answers to open-ended prob- 
lems in mathematics. In that work, the authors discuss the 
challenges in developing models to predict teacher assigned 


grades for student open responses in mathematics, using a 
dataset of authentic student responses within the ASSIST- 
ments [11] learning platform. Erickson et al. compares 6 
models utilizing machine learning (e.g. random forest and 
XGBoost [6]) and more complex deep learning (e.g. LSTMs 
[12]) techniques, combined with natural language process- 
ing algorithms to assess responses that are combinations of 
mathematical expressions and non-mathematical text. For 
the feature extraction process from the open response data, 
the study uses the Stanford Tokenizer [16] combined with 
Global Vectors for Word Representation (GloVe) [19]. 


3. METHODOLOGY 


In this study, we build upon the work of [9] to develop and 
evaluate an automated scoring model based on the SBERT 
methodology; as will be detailed further, we refer to this 
model as the SBERT-Canberra model throughout the re- 
mainder of this work. Then, in a secondary analysis, we 
utilize real data collected from a pilot study of our model 
running within a computer-based tool that provides teach- 
ers with suggested scores to explore the limitations of our 
approach through an exploratory error analysis. Our data 
and approach to these analyses are described in this section. 


3.1 Dataset 


In this work, we utilize two datasets? of student answers to 
open-ended questions paired with teacher-provided assess- 
ment scores. An example of one of these open-ended math- 
ematics questions is shown in Figure 3. In this example, 
students are not asked to find the area of the triangles, but 
rather explain in their own words what one of the figures is 
illustrating an approach to solving the problem. 


For the development of our SBERT-Canberra model, we use 
the dataset (and evaluation code) from the Erickson et al. 
study [9]. This dataset is comprised of student answers to 
open response questions within the ASSISTments[11] online 
learning platform; the dataset consists of 150,477 total stu- 
dent responses from 27,199 unique students to 2,076 unique 
problems graded by 970 unique teachers. As was performed 
in [9], we omit any case where a student response contained 
no characters (e.g. an empty response or one containing only 
whitespace characters), or contained nothing but an image 
(cases where there was an image accompanied by other text 
or non-whitespace characters is not omitted). The removal 
of such empty responses resulted in the dataset dropping to 
141,612 graded student responses, 25,069 unique students, 
2,042 unique problems, and 891 unique teachers. Within this 
data, each response is accompanied by a teacher-provided as- 
sessment score that follows an integer ordinal 5-point scale 
from 0-4; a “4” here is synonymous with a student receiving 
a 100% for the response. 


Table 1 lists several student answers contained within the 
dataset, chosen from across multiple problems for illustra- 
tive purposes. As was noted in the introduction, these re- 
sponses highlight some of the challenges of this modeling 


?The data and code used in this work cannot be publicly 
posted due to the potential existence of personally identi- 
fiable information contained within student open response 
answers. In support of open science, this may be sharable 
through an IRB approval process. Inqueries should be di- 
rected to the trailing author of this work. 
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Table 1: Sample student responses (selected from across multiple problems for illustrative purposes) and the teacher provided 
scores on a scale of 0 to 4 to the open ended questions in mathematics. 


Sample Responses Score | 
y=4x-2 4 | 
I counted 4 | 
I multiply -3 and 2x 2 | 
diagram is on paper 3 | 
Yes Because Y=mx+b 0 | 
I got 2/9 by dividing by 4 | 
I was not in class for this so I don’t know. 1 | 
I went multiplication first then division then multiplication 3 | 

I got this by doing 45/75. I knew that 75 + 75 = 150 4 

and 150 goes into 450 3 times and 3 x 2 = 6. So the answer is 6. 
You would need an example and then you would need to draw a line 

and find out far away your shape is from the line and mark it and then do that 4 

on the rest of your lines on the shape | 
The distributive property means that a number outside a set of parentheses 
can be multiplied by each of the numbers within the parentheses and the answer 1 
will be the same. It works because it would be the same as multiplying each number 
by the number outside the parentheses and then adding them together. 


task. First, the length of responses varies greatly between 
students as well as across problems. In addition to this, the 
interleaving of mathematics and linguistic text likely makes 
it difficult for pre-trained embedding models to interpret. 
Similarly, the variation in mathematical representation (i.e. 
the use of the term “dividing” rather than the “/” operator), 
may lead to confusion in a machine learning model trained 
over such data. As the mathematical variables are also rep- 
resented by recognized english characters (e.g. “y”), it may 
be difficult to derive semantic meaning for such tokens. It is 
for this reason that we hypothesize that a contextual-based 
embedding approach, such as BERT and SBERT, may be 
superior to traditional embedding methods that do not ac- 
count for context within the sentence. Finally, the noise in 
ground truth labels become evident from the table. The stu- 
dent who answered “I counted” but still received full credit, 
for example, exemplifies that some teachers may score stu- 
dents based on completion or other factors unrelated to their 
demonstration of understanding or mastery. This is not to 
say that any one scoring method is more correct or valid 
than another, but rather that there is likely large variation 
in these labels, making it difficult for machine learning mod- 
els to effectively learn associations between student answers 
and these scores in some cases. 


The second dataset used in this work is comprised of stu- 
dent responses collected during the pilot testing of a teacher- 
augmentation tool designed to aid in the assessment of stu- 
dent open response answers within ASSISTments [11]. This 
tool, called QUICK-Comments, used our developed model 
to predict the scores of student answers to open response 
questions in mathematics. Models were trained over the 
same open educational resource (OER) curricula from which 
the problems used in the first dataset were collected and 
produce estimates using the same grading scale as the first 
study. During the pilot study, 12 middle school mathematics 
teachers were given access to the tool and compensated for 
their time to assign, assess, and provide feedback to student 


open ended work during the Spring and Fall of 2020. This 
dataset consists of 30,371 graded student open responses to 
915 unique open response problems solved by 1,628 unique 
students. 


3.2. SBERT-Canberra Model 


The model developed for this work follows a 2-stage process 
to generate estimates of teacher-assigned scores for a set of 
given student answers. In approaching this model, we pro- 
pose a reframing of the initial problem. In [9], the problem 
was posed as a traditional supervised learning problem; in 
other words, given a set of student answers A, train a model 
f(.) such that Y = f(A). Instead, we propose a more unsu- 
pervised approach as depicted in Figure 4. If we have a set 
of historic answers Ao...r—-1, and want to predict the score 
of a new answer A,, a logical choice of score may be that 
corresponding with the historic answer that is most similar 
to the new answer A,. In this way, the problem is posed as a 
similarity ranking problem rather than a supervised learning 
problem. 


There are several potential advantages to this approach. 
First, when utilizing a pre-trained model of SBERT, de- 
scribed in Section 2, no actual model training is necessary (so 
long as a reasonable distance metric is identified). Second, 
as SBERT is optimized for contextual similarity tasks, the 
problem is better suited to utilize the embedding method’s 
strengths. Finally, in a practical sense, as no model train- 
ing is necessary (beyond utilizing the pre-trained embedding 
model), such a model can be more easily applied at scale, 
requiring just a pool of historic answers to compare against. 
We hypothesize that this method may also require fewer ex- 
ample answers than traditional machine learning methods 
as well, but this claim is not deeply explored in this current 
work. 


In applying this method, the set of historic answers Ao...n—1 
are fed through the pre-trained SBERT model to produce 
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Table 2: Features for the Linear Model of Error analysis of SBERT-Canberra model 


Title Description Mean 

Answer Length Length of the answer 10.39 
Average character per word the average number of characters per words 3.54 
Numbers count total number of digits 3.54 
Operators count total mathematical symbols in the response 1.47 
Equation percent percentage of mathematical equations in answer | 0.27 
Presence of Images Indicator of presence of images in the answer 0.15 


Ag : Most similar answer 


min(Canberra_distances) 


Canberra_distance(Ap 4,2 |. n-1, An) 


List of Historic 
answers score 


Answer to predict 


Figure 4: The design of the SBERT-Canberra method, that 
suggests scores based on similarity between the answers. 


a 768-valued feature vector for each answer; these vectors 
are then stored for later access.. Given a new answer, An, 
a feature vector is similarly produced. In stage two of our 
method, all pairwise comparisons are then made between 
Ar and Ao...n—1, calculating Canberra distance [13] for each 
pair. Canberra distance, as opposed to other common dis- 
tance metrics such as Euclidean or Cosine similarity, is a 
distance metric calculated over ranked lists. With this met- 
ric calculated for all pairs, the Ao...n—-1 historic answers are 
then min-sorted to identify the most similar historic answer, 
A,, to our new answer A,,. The score associated with A, is 
then used as the prediction for the given answer An. The 
design to this approach is outlined in Figure 4. 


As an additional component of this model, a “fallback” con- 
dition is implemented to be able to produce scoring esti- 
mates for problems where there are no historic answers on 
which to compare. In this case, we train a single multi- 
nomial regression model over all known answers, utilizing 1) 
the number of words in the answer and 2) the average length 
of each word in the answer; this model produces a probabil- 


ity distribution over 5 categorical labels (observing the 0-4 
grading scale as a multinomial regression formulation). This 
one model is trained over all known answers and used then 
only in the case that no historic answers are available for 
the SBERT-Canberra model. This component is viewed as 
being part of our SBERT-Canberra approach. 


3.3 Evaluation of SBERT 


To evaluate our SBERT-Canberra scoring method, we utilize 
the same data and code presented in [9]. In that paper, the 
authors present the usage of a 2-parameter rasch model [24] 
(equivalent to a traditional item response theory, or IRT, 
model). The purpose of this model is to learn a separate 
parameter for each student and problem presented, repre- 
senting student ability and problem difficulty, respectively. 
The intuition behind the use of this model is to evaluate 
an NLP automated scoring model based solely on its abil- 
ity to interpret the words in each student answer. As the 
score of each answer is likely correlated with student ability 
(or knowledge) and problem difficulty (e.g. easy problems 
are likely to exhibit higher scores), such a model provides a 
reasonable minimum baseline of comparison. By adding a 
model’s scoring estimates as covariates to the rasch model 
and then comparing the performance of such a model to the 
rasch model without covariates, we are able to observe the 
true value-added performance of the NLP scoring model. 


Following the same procedure as conducted in [9], we are 
able to directly compare our Sentence-BERT method to 
those presented in that prior work. The models are trained 
and evaluated using a 10-fold student-level cross validation, 
and model performance is compared based on 3 performance 
metrics. First, treating the label as multinomial, rather 
than ordinal, AUC is caluclated using the method described 
in [10]. Second, the root mean squared error (RMSE), is 
calucalted over the ordinal prediction and label. Finally, a 
multi-class kappa is calculated, again using the multinomial 
label representation. The multinomial representations were 
argued to be appropriate due to the likely non-linear distri- 
bution of scores, while then RMSE provides insight into a 
more linear assumption of the data. Arguably an additional 
rank-based metric such as Spearman’s Rho would also be a 
suitable metric of comparison, but is not included for more 
direct comparisons to the previous work. 


3.4 Approach to Error Analysis of the SBERT- 


Canberra Method 


In evaluating the SBERT-Canberra method, it is impor- 
tant to explore limitations of the approach in order to iden- 
tify where the model does well and where it may yet im- 
prove through future iteration. As such, we also conduct 
an exploratory error analysis of the method using the data 
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Table 3: Rasch Model Performance compared to the models developed in Erickson et al.[9] 


Model 


AUC RMSE Kappa 


Current Paper 


Rasch* + SBERT-Canberra 


Erickson et al. 2020 
Baseline Rasch 


Rasch + Number of Words 
Rasch* + Random Forest 


Rasch* + XGBoost 
Rasch* + LSTM 


0.856 0.577 0.476 


0.827 0.709 0.370 
0.829 0.696 0.382 
0.850 0.615 0.430 
0.832 0.679 0.390 
0.841 0.637 0.415 


*These rasch models also included the number of words. 


collected from the QUICK-Comments pilot study. Toward 
this, we observe two regression models that observe absolute 
model error as a dependent variable. By exploring charac- 
teristics of student answers in the context of this model- 
ing error, we can observe which aspects correlate most with 
higher prediction error. Similarly, we apply then a multi- 
level model to observe which of student-, problem-, and 
teacher-level identifiers most explains any observed model- 
ing error. 


3.4.1 Uni-level Linear Model 

The uni-level linear model is based on student answer level 
characteristics. The student answer level characteristics are 
comprised of a set of six answer-level features extracted from 
the student open response data. These features are listed in 
Table 2. In calculating these features, the answer is first to- 
kenized using the Stanford NLP tokenizer[16], dividing each 
textual answer into smaller tokens. For example, if the re- 
sponse to a particular problem is “I got 2/9 by dividing by 
4”, a simple tokenizer splits this response text by spaces 
which would give the list of tokens as: (“I”, “got”, “2/9”, 
“by”, “dividing”, “by”, “4”). Then from the tokenized data, 
we separate the tokens consisting of either digits or math- 
ematical symbols. The number of such tokens is divided 
by the total number of tokens to calculate the equation per- 
centage’. The average equation percentage calculated by the 
procedure mentioned above is 27% across the entire dataset. 
For calculating the length of the answer text, we count the 
total words in the text simply by splitting them by space. 
The average length of answers across the dataset is 10.39. 
Similarly, within each response, the number of numeric dig- 
its (i.e. Numbers count) and number of operator characters 
(i.e. Operators count) are counted independent of the to- 
kens. 


ASSISTments[11] allows students to upload images as part 
of the response to open-ended questions; this is most com- 
monly a picture taken of work done on paper. The response 
text in such cases includes the URL of the uploaded image to 
the system. About 15% of the total responses in the dataset 
contains images. Some of such responses are entirely images, 
whereas in others, some text is provided as context. Since 
these scoring models are not yet designed to support im- 
ages, we hypothesize that the images’ presence contributes 


3We acknowledge that this feature is a misnomer as it in- 
cludes numeric terms, operators, and expressions as well as 
equations, but chose this feature name for sake of brevity. 


significantly to the modeling error. 


A simple linear regression model is fit to the pilot study 
student answers, observing absolute model error as the de- 
pendent variable. This value is calculated by simply sub- 
tracting the predicted score from the teacher-provided label 
(as a linear label), and taking the absolute value. In this 
case, a value of 0 would indicate a correct estimate, while 
higher values (up to 4) represent greater prediction error; 
we do not differentiate between under- and over-predicting 
in this analysis. 


3.4.2 Multi-level Linear Model 

The uni-level linear model observes features that describe 
characteristics of the student responses, but as described 
in Section 3.1, modeling error may not be confined to just 
characteristics of the responses themselves. It is very likely 
that modeling error can be attributable to other external 
factors at the student-, problem-, and teacher-levels. 


To explore this possibility, we apply a multi-level linear 
model observing the student answer characteristics as fixed 
effects, and student, problem, and teacher identifiers as three 
separate level-2 random effect variables. As it is the case 
that the same student may write multiple answers within our 
data, this structure is similar to that of a repeated-measure 
analysis. 


abs(model error) =Answer Covariates 

+ (1|student identifier) 
+ (1|problem identifier) 
+ (1|teacher identifier) 


(1) 


Again observing absolute prediction error as the dependent 
variable, this analysis will be able to answer 1) whether the 
majority of explainable variance exists at the student-answer 
level or at a higher level, and 2) which of student-, problem- 
, and teacher-level identifiers most explains variance in our 
modeling error (e.g. which of these identifiers is most corre- 
lated with the error). The equation, expressed as its R code 
formulation, is reported as Equation 1. 
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Table 4: The resulting model coefficients for the uni-level linear regression model and random and fixed effects of the multi-level 


linear model of absolute error. 


Uni-level Linear 


Multi-level Linear 


Variance Std. Dev. Variance Std. Dev. 
Random Effects 
Student = — 0.034 0.185 
Problem _ — 0.313 0.559 
Teacher — — 0.048 0.851 
B Std. Error B Std. Error 
Fixed Effects 
Intercept 0.581*** 0.017 0.772*** 0.070 


Answer Length 
Avg. Word Length 
Numbers Count 
Operators Count 
Equation Percent 
Presence of Images 


-0.008*** 0.001 
-0.014*** 0.003 

<0.001 <0.001 
-0.006*** 0.001 
0.443*** 0.018 
2.248*** 0.021 


-0.009*** 0.001 
-0.013** 0.003 
<0.001 <0.001 
0.002 0.001 
0.080*** 0.022 
1.858*** 0.028 


*p <0.05 **p<0.01 ***p<0.001 


4. RESULTS 


4.1 SBERT Model 

The results of the SBERT model is compared directly to the 
results from Erickson et al.[9] as shown in Table 3. As can 
be seen in that table, the SBERT-Canberra method outper- 
formed the baseline as well as all previous models across all 
three metrics. While the difference in AUC values between 
our method and the previous best approach is notably small, 
the difference in both RMSE and Kappa appears to be com- 
paratively larger. To interpret these two metrics, these re- 
sults suggest that we should expect teachers to agree with 
our method’s estimates 47% of the time accounting for ran- 
dom chance, and is likely to be wrong by just over half a 
grade-point on average. This also does suggest, however, 
that there is still plenty of room for improvement of these 
models. 


What is also worth noting from the results of Erickson et 
al. [9], is the high performance of the baseline rasch model. 
This emphasizes the difficulty of this NLP modeling task 
in that the baseline model is using nothing other than the 
student and problem identifiers; it is able to seemingly pre- 
dict teacher-provided scores with an AUC above 0.8 without 
using any part of the student response; there is only a 0.03 
AUC difference between that baseline model and our current 
proposed method. This suggests that these external factors 
may be explaining a large portion of the student scores, and 
may subsequently explain a large portion of our prediction 
error. 


4.2. Error Analysis of SBERT 

In exploring this further, the results of the error analysis of 
the SBERT-Canberra method are presented in Table 4. It 
is found that the uni-level linear model explains 38.6% of 
the variance of the outcome as given by r-squared. Out of 
the six student answer-level features, nearly all were found 
to be statistically reliable predictors of model error; in veri- 
fying these results, it was found that all included covariates 
exhibited inter-correlations less than 0.3 (suggesting a mod- 


erately low impact of multicollinearity potentially skewing 
the interpretation of these results). In close examination of 
the coefficients of these features, however, despite being sta- 
tistically reliable, many are found to be close to 0, suggesting 
a very little meaningful correlation with the modeling error. 
This is not the case, however, for two of these variables, 
Equation Percent and Presence of Images, we see a more 
meaningful coefficient. This suggests, due to the direction 
of this value, that the presence of mathematical elements as 
well as the presence of images (unsurprisingly) both corre- 
late with higher prediction error. It further follows, then, 
that further improvements to the SBERT-Canberra method 
should explore methods of better representing and account- 
ing for these mathematical terms in student responses; sim- 
ilarly, though likely much more difficult, incorporating an 
aspect of image recognition could be another area worth ex- 
ploring. 


In regard to the multi-level linear model, accounting for 
student, problem, and teacher identifiers each as random 
effects, we see that the inclusion of these level-2 factors ex- 
plains some of the impact of the fixed effects (also in Ta- 
ble 4). Here it is found that all but two of the fixed effects 
are statistically reliable. It is also found that the magnitude 
of the coefficients for the Equation Percent and Presence of 
Images is also reduced. This suggests that, perhaps, the 
student and/or problem identifiers partially explain these 
correlations (some problems may be more likely to have re- 
sponses with images or mathematical terms in them, or some 
students may be more inclined to use images or such terms 
more than others). What is worth noting, however, is that 
it was found that the level-2 variables account for 55.5% of 
the variance of the outcome. This suggests that a majority 
of the modeling error can be explained by these factors that 
are external to the student answers. 


Looking at the variance of the random effects, it can be seen 
that the problem level identifiers contribute most in terms 
of explaining the variance of the outcome. It is certainly the 
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case that the SBERT-Canberra method is accounting for 
each individual problem in producing its estimates (e.g. it 
only observes historic answers within each unique problem), 
but it would seem that there are other problem-level factors 
that are not being accounted for within this approach. 


5. LIMITATIONS AND FUTURE WORK 


In regard to our approach as well as in light of our findings, 
there are several limitations and opportunities for future 
directions. While the SBERT-Canberra approach, utiliz- 
ing sentence-level embeddings, outperforms the previously- 
developed models in predicting scores for open responses, 
the difference in AUC is rather small; the fact that the 
method produces a classification (as opposed to a probabil- 
ity as is often the case with such models) likely impacts its 
AUC performance. The manner in which the method makes 
its prediction can be considered a greedy approach in that 
only the closest historic answer is used to predict the score. 
Instead, a weighted vote approach using all historic scores 
(or a subset of similar scores above an identified threshold) 
may improve the model by allowing for some degree of un- 
certainty. Similarly, the use of the word count model as 
a fallback may further be improved; while it was the case 
that there were very few instances of problems not having 
enough data within the cross validation, improving this fall- 
back method may help to improve the model when applied 
in practical settings where the “cold start” problem is more 
prevalent; as the method currently relies heavily on having 
a sufficiently-sized pool of human-scored historic answers, 
future research can focus on utilizing unlabeled student an- 
swers or exploring other unsupervised methods that may 
additionally support these methods in cases where labeled 
data is scarce. 


While the SBERT-Canberra model performed arguably well, 
the error analysis revealed several areas where this approach, 
as well as others, may focus in future works. Most no- 
tably, as highlighted, the use of mathematical expressions 
and terms were found to be correlated with higher error; 
improving the representation of such elements can certainly 
be addressed in future work. A limitation of this, however, is 
that both models left variance unexplained in the outcome. 
We chose to look at these factors based on hypotheses and 
anecdotal observations, but there may be other large factors 
that can explain more of the error that we are seeing. Sub- 
sequent works could conduct more thorough surveys of both 
answer-level and higher-level factors. Future works can also 
explore additional model structures and language features 
that may lead to improvements to performance. The anal- 
yses presented in this work, however, can act as a baseline 
to further evaluate if future iterations of our approach truly 
improve upon these identified areas. 


It is also the case that this work focuses only on models that 
predict numeric assessment scores, while we strongly believe 
that it will be equally, if not more important to additionally 
develop methods to suggest or generate directed feedback 
for for these student answers; teachers use textual feedback 
messages to offer constructive guidance to students, but it 
is often a very time-consuming task to write these messages 
for each students’ answer. We believe that the SBERT- 
Canberra approach can be extended to support this task 
as well, where such a model may be able to recommend 


feedback to new student answers that has been previously 
given to an identified similar historic answer. Future work is 
intended to explore these methods further for such feedback- 
suggestion tasks. 


6. CONCLUSION 


In this paper, we have presented a novel approach in address- 
ing and formulation of the problem of automating the assess- 
ment of student open-ended work. We have illustrated that 
our SBERT-Canberra method outperformed a previously- 
established benchmark, but still exhibits areas where it may 
be able to improve. Through the conducted error analy- 
sis, we have identified areas where more advanced meth- 
ods of image processing and natural language processing (or 
math language processing), may lead to further improve- 
ments. With all of this, however, it was also identified that 
problem-level features appear to be most impactful in ex- 
plaining the variance of modeling error; this is particularly 
surprising as variations in teacher grading were previously 
hypothesized to be a larger factor in this context. 


With the findings from the study, our goal next is to use 
them to overcome the limitations mentioned above and guide 
our focus on improving the methods for assessment of open- 
ended questions in mathematics. It is the goal of this work 
to act as a step toward building better teacher supports for 
these types of open-ended problems, as well as provide others 
with guidance toward the same or similar goals. 
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ABSTRACT 


With the emergence of MOOCs, it becomes crucial to automate the 
process of a course design to accommodate the diverse learning 
demands of students. Modeling the relationships among educational 
topics is a fundamental first step for automating curriculum planning 
and course design. In this paper, we introduce Topic Transition Map 
(TTM), a general structure that models the content of MOOCs at 
the topic level. TTMs capture the various ways instructors organize 
topics in their courses by modeling the transitions between topics. We 
investigate and analyze four different methods that can be exploited 
to learn the Topic Transition Map: 1) Pairwise Constrained K-Means, 
2) Mixture of Unigram Language Model, 3) Hidden Markov Mixture 
Model, and 4) Structural Topic Model. To evaluated the effectiveness 
of these methods, we qualitatively compare the topic transition maps 
generated by each model and investigate how the Topic Transition 
Map can be used in three sequencing tasks: 1) determining the 
correct sequence, 2) predicting the next lecture, and 3) predicting the 
sequence of lectures. Our evaluation revealed that PCK-Means has 
the highest performance in the first task, HMMULM outperforms 
other methods in task 2, while there is no winning in task 3. 


Keywords 
Topic Transition Map, Topic Transition, Word Distribution, Mixture 
Model, Hidden Markov Model, Clusters, Sequencing Tasks. 


1. INTRODUCTION 


For many decades, the process of creating courses has been a manual 


task that needs to be carefully managed by instructors and experts. 


However, with the recent advances in technologies and the emergence 
of Massive Open Online Courses (MOOCs), it becomes critical 
to automate the process of course design to accommodate the 
heterogeneity of online students and their diverse needs. According 
to [32], learning on demand is considered one factor that causes 
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the high dropout rate in MOOCs. Learners have different learning 
demands depending on their motivations and goals. For instance, 
learners may seek knowledge about an interdisciplinary domain 
and hence need to learn modules from courses in several areas. 
This problem requires adopting a model in which MOOCs are used 
as modularized resources, rather than a set of pre-designed static 
courses. A crucial first step toward developing such a model is the 
automation of course plan design by sequencing lectures among 
different courses. 


The main principle in designing the curriculum of any course is to 
organize course content according to some relations between topics. 
For instance, to help students to learn the materials, instructors 
carefully organize lectures as a sequence, based on the difficulty 
levels of topics [10, 27, 1] as well as the dependency relations 
between topics [11, 21, 23, 1]. The fundamental sequential structure 
of a course design is to place topics that are easy or prerequisite 
in earlier lectures while more advanced and dependent topics are 
taught in later lectures [1]. Consequently, modeling the relatedness 
among educational topics is a very crucial first step for automating 
curriculum planning and course design. 


Modeling the content structure of MOOCs has recently attracted 
much research. Most of the current research has focused on modeling 
the prerequisite relationships between courses [29, 15], between lec- 
tures or segments of lectures [6, 7], or between concepts discussed 
within or across courses [3, 14, 17, 29, 15]. Using concepts to model 
MOOCs’ content can be easily generalized to capture the relations 
in the concept space. However, because concepts are represented 
as keywords or phrases, it is hard to capture the different levels of 
granularity between lectures and courses. In addition, modeling pre- 
requisite relationships between concepts cannot capture the various 
learning paths accommodated by different courses. 


In this paper, we introduce the Topic Transition Map, a general 
structure that models the educational materials at the topic level. 
We model a course as a set of topics, and each topic is a set of 
concepts. Modeling content at the topic level is a more natural way 
to design custom course plans. We can think of a course as a path in 
the generalized Topic Transition Map. Thus, designing a new course 
becomes a task of identifying a path in the Topic Transition Map. 
Additionally, we investigate four methods that can be leveraged to 
construct the Topic Transition Map: Pairwise Constrained K-Means 
(PCK-Means) [2], Mixture of Unigram Language Model (MULM), 
Hidden Markov Mixture Model (HMMULM), and Structural Topic 
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Model (strTM) [24]. We analyze and compare the Topic Transition 
Maps learned by these methods by studying how to exploit the 
Topic Transition Map in three sequencing tasks: 1) determining the 
correct sequence, 2) predicting the next lecture, and 3) predicting 
the sequence of lectures. To the best of our knowledge, we are the 
first work to introduce and investigate the use of Topic Transition 
Maps in modeling MOOCs content and sequencing lectures. 


To evaluate the effectiveness of all methods, we use real MOOCs from 
three different domains: Python, Structural Query Language, and 
Machine Learning Clustering algorithms. Our evaluation revealed 
that while the PCK-Means has the highest performance on the task 
of finding the best sequence from a list of possible sequences, the 
HMMULM achieves the best performance on the task that predicts 
the next lecture in the sequence. Additionally, all methods perform 
similarly in the task of predicting the whole sequence with MULM 
has the lowest performance as it sometimes cannot predict the whole 
sequence. In addition to comparing various models in sequencing 
tasks, we visualize the Topic Transition Maps generated by different 


methods to qualitatively compare the resulted topic transition maps. 


We found that PCK-Means has extracted more meaningful topics 
with the best word distributions that clearly explain each topic. 


The rest of the paper is organized as follows. In section 2, we present 
some of the related work. Section 3, defines the topics and topic 


transitions and states some applications of the topic transition maps. 


In section 4, we formally define our problem before describing the 
four different methods we exploit to construct the topic transition 
maps in section 5. Section 6 elaborates on our approach for the 
evaluation and the analysis of various models. Finally, we conclude 
our work in section 7. 


2. RELATED WORK 

Most of the work that models the content of MOOCs has focused 
on capturing the prerequisite relationships using different levels of 
granularity such as courses [29, 15], lectures or segments of lectures 
[6, 7], or concepts discussed within or across courses [3, 14, 17, 29, 
15]. Modeling the relations between courses, lectures, or seqments of 
lectures is restricted to these units and cannot be generalized. While 
modeling dependency relations between concepts is considered a 
general structure that captures the required concepts before learning 
any concept, prerequisite relations cannot model the various learning 
paths accommodated by different courses. ALSaad and Alawini [2] 
have addressed this problem by proposing the precedence graph 
that captures the similarities and variations of learning paths among 
different courses. We build on their work and introduce the Topic 
Transition Map that maps each lecture to a topic and leverages the 
sequences of lectures among courses to capture the topic transitions 
pattern and hence the likelihood of such a transition. The main 
difference between the Topic Transition Map and the precedence 
graph is that Topic Transition Map models self transitions between a 
topic and itself and also captures how likely each topic to be the first 
topic in courses. While ALSaad and Alawini [2] have investigated the 
use of PCK-Means in modeling the precedence graph, in this paper, 
we explore three more methods in addition to PCK-Mean, namely 
MULM, HMMULM, and strTM, for modeling Topic Transition 
Maps. We also examine the impact of the learned topic transition 
maps on three different sequencing tasks. We believe that we are the 
first work that examines the use of topic transitions modeled from 
existing MOOCs to learn how to sequence new courses. 


Some research has investigated the use of prerequisite relations 


between concepts to construct and sequence learning units [1, 16]. 


Both studies [1, 16] have developed supervised approaches based 
on feature engineering that extracted features from some external 
knowledge such as Wikipedia [1] and DBpedia [16] to infer the 
prerequisite relations between concepts. Our work is different as 
instead of modeling the prerequisite relations between concepts 
using supervised approaches, we model the Topic Transition Map or 
the various paths between topics using unsupervised methods, where 
a topic is a set of concepts. In addition, our methods rely only on the 
content of MOOCs without using any external knowledge. While 
Agrawal et al. [1] used the concept dependency graph to organize 
concepts to construct learning units and then sequence the learning 
units, we use lectures from existing MOOCs and investigate the 
impact of the learned transitions between topics to sequence lectures. 


The most relevant research to our study is the work by Shen et al. [22]. 
Shen et al. [22] have proposed a method for linking similar courses 
to construct a map of lectures connected by two types of relations: 
similar and prerequisite. The constructed map only captures the 
similarity and prerequisite relations between certain units (lectures) 
and is not generalized to other lectures and thus cannot be used to 
predict the sequence of new lectures. In this paper, we map lectures 
to topics and construct the Topic Transition Map that depicts the 
precedence relations between topics and hence not tied with any 
specific units. Having a generalized Topic Transition Map can help 
in finding the sequence of lectures or predict the next lecture in the 
sequence as we discuss in section 6.2. 


Another related line of research is the work on structural topic 
modeling by the Natural Language Processing, NLP, Community. 
In NLP, topic transitions have been used to model latent topical 
structures inside documents by assuming each sentence is generated 
from a topic where topics satisfy the first order Markov property 
(12, 25]. While Gruber et al. [12] only modeled the transition between 
topics as a binary relation (either remain on the current topic or 
shift to a new topic with a certain probability), Wang et al. [25] 
have developed a Structural Topic Model called strTM to explicitly 
model the topic transitions as probabilities that capture how likely 
one transits from a topic to another. Modeling transitions have been 
used in many applications related to NLP such as sentence ordering 
[25], topic segmentation [9], and multi-documents summarization 
[28]. In this paper, we investigate the use of topic transitions on 
modeling the topical structures in MOOCs by assuming a lectures 
is generated by one topic and use the sequences of lectures to learn 
the transitions between topics. We also explore the impact of using 
the Topic Transition Map to sequence lectures in three different 
sequencing tasks. 


3. TOPIC TRANSITIONS 


Before defining the topic transitions, it is important to briefly explain 
our representation of topics used in this paper. Similar to the definition 
of topics in the literature of the topic modeling research [5, 13], 
we define a topic as a distribution of concepts where concepts 
with higher probabilities tend to explain or characterize the topic. 
Concepts can be represented as words or phrases of words [3, 18, 26]. 
Each lecture is a composition of concepts and hence can be mapped 
to some topics. Depending on the length of lectures, lectures can 
cover one or more topics. Longer lectures usually cover more topics 
than shorter lectures. For example, traditional university lectures 
tend to be more elaborated and have longer duration than MOOCs 
lectures, which are usually concise and short in length. Therefore, 
the number of topics per lecture discussed in MOOCs is less than 
that of traditional university lectures. In this paper, since our work 
focus on learning the topic transitions from MOOCs, we assume 
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that each lecture is mapped only to one topic. This assumption is 


reasonable as lectures in MOOCs are concise and short in length. 


Having this assumption is also very useful as it helps in leveraging 
the sequences of lectures to learn the relations between topics. 


A topic transition captures the precedence relations between topics. 


In other words, it means how likely instructors move or transit from 
one topic to another in the course delivery. It models the various 
ways of how instructors dynamically assemble concepts from the 


concepts space in order to construct the study plan of their courses. 


For instance, some instructors decide to start their Python course by 
explaining the topics: data types, conditional statements, loops, and 
then reading and writing from files. Other instructors may choose a 
different order, such as conditional statements, loops, string, and then 
lists. By leveraging the sequences of lectures from multiple courses, 
we can infer the latent topics of each lecture and hence model the 
common transition patters shared by multiple courses as well as the 
variations of different transitions or paths. To determine the strength 
or how common that transition is, each transition is attached with a 
score or a probability. For example, given the topics in Computer 
Science Programming: “Conditional Statements”, “Loops”, and 
“Arrays”. It is more likely that instructors will explain the topic 
“Conditional Statements” immediately before the topic “Loops” and 
thus the topic transition score between them would be higher than 
the transition score between the topic “Conditional Statements” and 
the topic “Arrays”. 


Learning the topic transitions can be the initial block to facilitate 
several useful applications that can support modern learning. For 
instance, we can use the Topic Transition Map to extract the most 
common paths of topics in the field or explain the topic space in 
the current MOOC offerings. Learners can use transition maps to 
get more insights about the structure of topics in MOOC offerings. 
On the other hand, instructors can use these maps to improve their 
course offerings by examining the topic structure of related courses. 


One important application of the Topic Transition Map is to support 
automatic curriculum planning and course design. Since courses 
consist of topics, learning the relations between topics would be the 
initial step to understand how likely instructors transit from one topic 
to another. We can think of a course as a path in the generalized 
Topic Transition Map. Thus, designing a new course becomes a task 
of identifying a path in the Topic Transition Map. In this paper, we 
analyze how can we use the learned topic transitions to sequence 
new courses. 


4. PROBLEM FORMULATION 


In this section, we formally formulate our problem. Given a set of 
courses C = {X1, X2, X3,..., Xn} from a particular domain, 
where JN is the total number of courses. We assume that courses in 
C are similar and hence have some content overlaps between them 
and also have the same difficulty level (e.g. Beginner, Medium, or 
Advance). A course X; is represented as an ordered list of lectures 
X; = (vi, %:2,...,4)x,|], where |X| is the total number of 
lectures in the course X;. Each lecture is a composition of concepts 
represented in some narrative way. In this paper, we assume a 
concept as a single word and hence lectures are represented using a 
bag-of-word representation. 


Given the number of topics M, our goal is to map each lecture to a 
topic and leverage the sequences of lectures to learn topic transitions 
and construct the Topic Transition Map. The Topic Transition Map 
is represented as a matrix A, where A € R“*™. Each entry ai; of 


the matrix A represents the likelihood of the transition from topic 7 
to topic 7. It reflect how common the precedence relation from topic 
7 to 7 in the dataset courses. In addition to the Topic Transition Map, 
we also aim to learn the probability of each topic being an initial 
topic in courses. We denote the initial probability of each topic as 
a vector 7, where 7 € R™. Along with the Topic Transition Map 
and the initial probability of each topic, it is important to model 
the word distribution of each topic, which represented as a matrix 
B ¢ R™~", where V is the vocabulary size. 


5. MODELING TOPIC TRANSITIONS 


In this section, we explain the four different models we exploit to 
capture topics and Topic Transition Maps. 


5.1 Pairwise Constrained K-Means 

PCK-Means clustering algorithm [4] is a variation of the standard 
K-Means algorithm. To cluster instances, PCK-Means incorporates 
distance between points as well as pairwise constraints to guide the 
clustering process. Since the purpose of clustering is to capture topic 
transition patterns across courses, using PCK-Means helps to restrict 
the clustering process to cluster lectures across courses instead 
of within courses [2]. To guide the clustering, PCK-Means uses 
two types of constraints: Must-Link and Cannot-Link. Must-Link 
constraint determines lecture pairs that need to be clustered together, 
while Cannot-Link constraint specifies pairs that should not be 
grouped into the same cluster. To find the clusters, PCK-Means uses 
an objective function that minimizes both: 1) the distance between 
points (lectures) and the cluster centroid, and 2) the penalty costs of 
violating the constraints. For more information about PCK-Menas, 
please refer to [4]. 


Similar to ALSaad and Alawini [2], we use PCK-Means to build the 
Topic Transition Map A. We first construct the list of Must-Link and 
Cannot-Link constraints to clusters lectures based on their content 
similarity into clusters. We assume that each cluster forms a topic 
and hence we need to learn the word distributions of each topic 
along with topic transitions. We link clusters by using the precedence 
relations between adjacent lectures and capture the strength of the 
transition by accumulating the frequency of transitions. To find 
the word distribution of each cluster or topic in the matrix B, we 
accumulate the vector representations of each lecture that belongs to 
the same cluster. For more information, please see [2]. 


In order to estimate the initial probability 7 for each topic, we simply 
count the number of times of each topic being the first topic in the 
set of courses C’. Then we do normalization to find the probability. 


5.2 Mixture of Unigram Language Models 

To capture topics, we use a mixture model of / unigram language 
models (MULM) with a bag-of-words representation. The mixture 
model is a generative probabilistic model that has been used for 
documents clustering. Thus, it will help in clustering lectures based 
on their topics, where each lecture belongs only to one cluster or one 
topic. In the mixture model, to generate a document, first one needs 
to choose the topic of the document according to the probability 
P(6;), where M is the number of topics, and then generate all the 
words in the document using the probability P(w|0;). According 
to the model, the likelihoods of a document x and the corpus C’ are 
calculated as follows: 


P(A) = Y7 PO) TT Plwlay iG 
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P(C|A) = 1" (a; |X) (2) 


To estimate the model parameters \ = ({6;}, P(0:)), where 6; is the 
word distribution of topic 2 from the matrix B , we use Expectation- 
Maximization algorithm [8] to find the parameters that maximize 
the likelihood of the data: 


= argmax P(C|A) (3) 
BN 


After learning the parameters, we map each lecture to a cluster or 
topic by maximizing the following equation: 


c = arg max P(2|z;) (4) 


24 


Using the mixture model can help in clustering lectures according 
to their topics, however; it will not capture the transition patterns 
between topics or the initial probability of each topic. Therefore, 
similar to the PCK-Means method, we leverage the sequences of 
lectures to calculate the score of topic transitions and construct the 
Topic Transition Map A. Likewise, we count the number of times 
courses start with each topic and normalize the results to model the 
initial probability 7. 


5.3. Hidden Markov Mixture Models 


Instead of separately clustering lectures and then learning the transi- 
tions between them, using a Hidden Markov Model would allow us 
to jointly learn the word distributions of each topic (B), the transition 
probabilities between topics (A) as well as the initial probability of 
each topic (7). 


The Hidden Markov Model, HMM, is a probabilistic graphical model 
that describes the process of generating a sequence of observable 
events according to some hidden factors [20]. It simulates how the 
real world sequence data is generated from hidden states. Particularly, 
it consists of two stochastic processes: 1) invisible process, and 2) 
visible process [30]. In HMM, invisible process consists of hidden 
states whereas visible process is observed sequence of symbols that 
are drawn from the probability distributions of the hidden states. 
Figure 1 demonstrates the HMM model. As you can see from Figure 
1, each observable event in the sequence are generated from a hidden 
state and observations are conditionally independent given the hidden 
state. You can also notice that the hidden states form a Markov chain 
where each hidden state depends only on the previous state such as 
Zt+1 depends on Z;. 


To control the process of generating the observed sequences from 
hidden states, HMM has three parameters: 7, A, and B. The first 
parameter, 7 = 71, 72,..., 7m, 1s the initial probability distribution 
of each hidden state. The parameter 7 determines the probability 
of the Markov chain to start at each state and hence controls which 
state can be chosen as an initial state for the observed sequence. 
The second parameter, A € R™*™, is the transition probability 
matrix that specifies how likely the model can transit from one state 
to another, denoted by P(Z+41|Z:) in Figure 1. The third parameter, 
B € R™™%” is the emission probability matrix, where V is the total 
number of distinct symbols. It determines the likelihood of each 
state to produce each symbol, denoted by P(X++1|Z++1) in Figure 
1. For example, to generate a sentence, a sequence of words would 
be drawn from the HMM model according to the three parameters 
a, A, and B. 


In MOOCs, we only observe courses, where courses are sequence of 
lectures, while the topics of lectures and the transition between them 
are invisible or latent. Therefore, HMM would be a great model to 
simulate the generation process of courses and hence infer the latent 
states that contribute in the evolution of these lectures. In HMM, 
each hidden state generates only one symbol or word (see Figure 
1). As our goal is to capture topic transitions using sequences of 
lectures as observed data, we map each lecture to a topic and assume 
each hidden state generates a lecture instead of a word. Our revised 
HMM assumes that each hidden state produces one lecture where 
each lecture is a bag-of-words. We ignore the sequence of words 
in lectures since the order of the words would not contribute to 
capturing the topic of each lecture. Figure 2 depicts the HAMULM 
utilized to capture the content of MOOCs. 


In order to capture both the lectures’ topics and the transitions be- 
tween them, we combine the mixture model (MULM) with HMM, 
and we call the new model Hidden Markov Mixture of Unigram 
Language Model (HMMULM). To do that, we assume the Marko- 
vian assumption between topics where in the generation process, 
the choice of the next topic depends only on the current topic. Even 
though, the choice of the topic in the course delivery depends on the 
previous topics discussed so far, this simplified assumption makes 
sense due to the locality of reference property [1] of course design. 
Based on this property, when an instructor designs a course, a depen- 
dent lecture should appear as soon as possible after the prerequisite 
lecture to reduce students comprehension burden. Therefore, assum- 
ing the dependency between adjacent lectures not only simplifies 
the model but also aids in capturing the transitions between highly 
related topics. By combining the HMM with mixture model the 
likelihood of generating a course is as follow: 


P(X|\) = 5 P(Z|A)P(X|Z, A) 
all Z 
T 
= = 5, P(z1) P@|aILP (24|2t-1) P(xt|zz) 
all Z t=2 
= PGi) [] Pla) yolwsea) 
all Z weV 
NG 2t|Zte—1 TI P(wlzt yrs) 
weVv 
M M 
= So Sora = Si) Il Bla = si, w) Ord) 
i=1 j=1 weV 
T 
[ [4G = $j, 24 = $;) II Bla = 83, w) rr) 
t=2 weVv 


(5) 


To estimate the HMMULM parameters \ = (7, A,B), we use a 
modified version of Baum-Welch algorithm in order to model the 
observation sequences as a multidimensional categorical events. 
Following the work [19], we derived the equations of E-step and 
M-step to train the model and infer the transition probability between 
topics. In the E-step, we use the equations: 
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Figure 1: The graphical model of HMM. 


Transition probability 
Sequence of states 
Emission probability 
observation 
x1 xt Xt Xt 
t 


Time 


Figure 2: The graphical model of HAMULM used to 
model the content of MOOCs. 
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In the M-step, the following equations are used to choose the pa- 
rameters that maximize the likelihood of the observed sequence of 
lectures: 


(i) =%71 (2) (8) 
Ag = Rete Sli 9) = 
pa ae yt (7) 
Coe oe, 1e(éc(w, ae) a 


es ys ye (i)e(v, xt) 


For more information about Baum-Welch algorithm, please see [20]. 
It is clear that the E-step and M-step equations are very similar 
to the standard HMM except that instead of emitting one symbol, 
HMMULM emits one lecture represented as a bag-of-words. 


5.4 Structural Topic Model 

The Structural Topic Model (StrT™M) [25] is another probabilistic 
graphical model that functions very similar to HMMULM. It has 
been used to model the latent topical structures inside documents. 
Like HMMULM, it models topics and their transitions as hidden 
states that emit lectures as bags-of-words. Unlike HMMULM, strT™M 
assumes each lecture as a mixture of content topics and functional 


Table 1: The dataset utilized in the experiment. 


Domain | # of Courses | # of Lectures | Avg # of Lectures 
Python 21 460 22 
SQL 15 247 16 
ML 10 99 10 


topic. Functional topic, denoted by zp, is used to filter out document- 
independent words that models the corpus background (or general 
terms) [31]. Each word in the lecture is either generated by one of 
the content topics or the functional topic: 


w ~ OP(w|B, 2) + (1 — 8) P(w|8, 26) a) 


where @ is the controlling parameter. According to strTM, the 
probability of lecture x; being generated by some topic 2; is: 


P(aj|zi) = |] (OP(wI6, x) +14) P(wI6, ze)" (12) 


weVv 


Another difference between strTM and HMMULM,, is that strTM 
assumes the transition probabilities A and the emission probability 
B are drawn from Multinomial distributions and use the conju- 
gate Dirichlet distribution to impose a prior on the Multinomial 
distributions: 


az ~ Dir(n) (13) 


Bz ~ Dir(y) (14) 


Where 77 and ¥ are the concentration hyper parameters that control 
sparsity of a, and §, respectively. 


To estimate the parameters of strTM, we use the expectation- 
maximization algorithm as described by [25]. For more information 
about strTM, please refer to [25]. 


6. EVALUATION 


In this section, we first demonstrate our dataset and the parameters 
settings. Second, we compare different models by studying the impact 
of topic transitions learned from various models on three lecture 
sequencing tasks. Finally, we qualitatively evaluate the topics and 
their transitions. 


6.1 Dataset and Parameters Settings 

We collected our dataset from real online courses using various 
MOOC platforms and in three different domains: Python, Structural 
Query Language (SQL), and Machine Learning Clustering algo- 
rithms (ML). Table 1 presents the statistic of the dataset. We use 
75% of the data as a training set and 25% as a test set. To choose the 
number of topics in each domain, we manually inspected the dataset 
to choose the number of topics. The number of topics for Python, 
SQL, and ML were set to 13, 10, and 9 respectively. 


Each course in the dataset is represented as a sequence of lecture video 
transcripts. We preprocess lecture transcripts by eliminating stop 
words and some rare terms. After cleaning the data, we constructed 
the bag-of-word vector representations of all lectures. We only use 
lecture transcripts to represent lectures; therefore, we only need to 
set two thresholds (/¢, and K2) of the PCK-Means method in order 
to select the list of Must-link and Cannot-link constraints. Since 
we do not have labeled data we chose the thresholds that maximize 
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Table 2: The performance of Task 1: Finding the correct sequence using the permutation method. It is clear that PCK-Means 


achieves the highest performance. 


Methods 
Dataset ' Measures 
Cosine | PCK-Means | MULM | HMMULM | strT™M 
kendall’s t(c) 0.60 0.73 0.49 0.66 0.54 
Dir-P 0.37 0.50 0.43 0.28 0.31 
Python 
Undir-P 0.52 0.56 0.50 0.44 0.35 
Lev-Sim 0.60 0.63 0.45 0.49 0.50 
kendall’s t(c) 0.58 0.59 0.58 0.55 0.41 
Dir-P 0.46 A Al f Al 
SQL ir 0.43 0.40 0.36 0 
Undir-P 0.67 0.58 0.49 0.44 0.41 
Lev-Sim 0.53 0.57 0.54 0.49 0.37 
kendall’s t(c) 0.68 0.75 0.63 0.58 0.64 
ML Dir-P 0.34 0.34 0.44 0.34 0.43 
Undir-P 0.52 0.52 0.57 0.41 0.53 
Lev-Sim 0.61 0.65 0.55 0.43 0.53 


the Silhouette Coefficient clustering measure using the training data. 


We set Ky = 0.55 and Ke = 0.004 for Python, Ki = 0.8 and 
Ke = 0.01 for SQL, and AK, = 0.55 and K2 = 0.01 for ML. To 
set the hyper parameters of strTM method, we used a grid search 
and chose the values that maximize the likelihood of the training 
data. We set 0 = 0.2, y = 0.3, and 7 = 0.6 for Python, @ = 0.1, 
y = 0.3, and 7 = 0.1 for SQL, and 6 = 0.1, y = 0.1, and n = 0.6 
for ML. 


6.2 Sequencing Tasks 

In this experiment, our goal is to compare the topic transitions 
modeled by different methods in three tasks: 1) Finding the correct 
sequence of lectures, 2) Predicting the next lecture given a sequence 
of lectures, and 3) Predicting the sequence of a list of lectures 
where the first lecture in the sequence is given. An example of real 
application for task 1 and task 3 is designing a new course plan by 
sequencing lectures before delivering them to students. However, 


task 1 and task 3 exploit two different techniques to find the sequence. 


In contrast, task 2 can be applied to recommend the next lecture to 
learners to customize their learning based on the history of lectures 
they already watched. In the evaluation, the purpose of each task is to 
compare different methods and evaluate the ability of the parameters 
(A, B, and 7) of each model to find the correct sequence in the three 
different tasks. 


6.2.1 Evaluation Measures 

To compare different models, we use the sequences of lectures 
from courses in the test set as the ground truth sequences and 
exploit different measures to do the evaluation. First, we follow 
Wang et al. [25] and use kendall’s r(c). Kendall’s r(o) is an 
information retrieval measure that captures the correlation between 
two ranked list. It indicates how the predicted order differs from 
the ground truth where 1 means perfect match, —1 means total 
mismatch, and 0 indicates that the two orders are independent. 
Second, we use Levenshtein normalized similarity which is the 
opposite of Levenshtein normalized distance that measures the 
minimum number of edits (insertions, deletions or substitutions) 
required to transform the predicted sequence to the ground truth 
sequence. The goal is to find the sequence that has the Levenshtein 
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normalized similarity close to 1 which indicates that the number 
of edits required is minimal. Third, we utilize the directed bigram 
precision (see equation 15) that captures the correctness of the order 
between adjacent lectures. The intuition behind using this measure is 
to evaluate whether the transition maps learned by different models 
have the ability to capture the correct direction order between topics 
and adjacent lectures. Finally, we use the undirected bigram precision 
shown in equation 16 to measure whether the transition map of each 
model can recognize adjacent lectures but incorrectly captured the 
direction between topics. 


# of correct(a — b) in estimated sequence 


# of correct(a > b) in ground truth 
(15) 


Poir bigram = 


# of correct{a, b} in estimated sequence 


PB, nae Gite —_ 
Raat PERC # of correct{a, b} in ground truth 


(16) 


6.2.2 Task 1: Finding The Correct Sequence 

To find the correct sequence of lectures, we follow the permutation 
method utilized by [25]. With courses that have large number of 
lectures, it is infeasible to find all the orderings of lectures. Therefore, 
when the number of the permutations exceeds 500, we randomly 
permutated 500 possible orderings of lectures as candidates. We ran 
the experiment 20 times for each method and recorded the average 
results. 


In order to select the optimal sequence from the list of permutations 
in strTM and HMMULM, we follow Wang et al. [25] and choose the 
sequence that has the highest generation probability calculated as: 


a(m) = arg max ys P( £50}, Loli]; s++5Xo[m]s Z\X) (17) 
Z 


o(m) 


To choose the best sequence for MULM, we first find the best topic 
c that generates each lectures in the test set according to equation 
4. After that, we select the sequence that has the highest likelihood 
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Figure 3: The performance of different methods in Task 3: Predicting the whole sequence. All methods have comparable performance. 


based on the equation: 
a(m) = argmax P(C)P(X|C) 
a(m) 
|o(m)| (18) 
P(c1)P(a1|c1) ) 2 P(cilci-1) P(x: |ci) 


Since PCK-Means is a clustering method that minimizes the distance 
between lectures and clusters’ centroids, we assign lectures x of the 
test set to the closest clusters z; using Euclidean distances as shown 
in equation 19. Then, we select the sequence that maximizes the topic 
transitions between lectures in the sequence as well as minimizes the 
distance between adjacent lectures (see equation 20). The intuition 
behind that is to ensure the topic coherence between adjacent lectures 
and also reduces the gaps by minimizing the distance between them. 


c(x) = arg min la — pz, || (19) 


an 


lo(m)| 


a(m) = arg max 7(c(#1)) > A(xs—1, 24) — lve — 24-1||? 


a(m) j=2 
(20) 


As a baseline we accumulate the cosine similarity between adjacent 
lectures in the sequence and select the sequence in the permutations 
that has the highest similarity score to be the optimal sequence. 


Table 2 summarizes the results of Task 1 for each method. We can 
notice that PCK-Means has the highest score in kendall’s (a) and 
Levenshtein normalized similarity in all datasets which indicates that 
PCK-Means has chosen the sequences that are very correlated to the 


Table 3: The performance of Task 2: Predicting the next 
lecture. It is clear that HMMULM achienes the highest 


performance. 
Accuracy 
Method Python SQL ML 
Cosine-Similarity 0.46 0.56 0.42 
PCK-Means 0.45 0.49 0.47 
MULM 0.41 0.34 0.37 
HMMULM(Viterbi) 0.52 0.56 0.60 
StrTM(Viterbi) 0.39 0.27 0.43 


ground truth sequences and need the minimal edits to be transformed 
to the ground sequences. However, PCK-Means only outperforms 
other models in the directed and undirected bigram precision in the 
Python dataset, indicating that it sometimes not able to capture the 
sequence between adjacent lectures. 


In general, it is clear that PCK-Means achieves the highest perfor- 
mance in most measures and almost in all the datasets. We think that 
combining the topic transitions with the Euclidean distance helps 
PCK-Means in finding the best sequence from the list of possible 
sequences. 


6.2.3 Task 2: Predicting The Next Lecture 

In task 2, each model predicts the next lecture given a sequence of 
lectures. We varied the length of the given sequence starting from 
one. As strTM and HMMULM are based on HMM, we utilized the 
Viterbi algorithm [20] to find the most probable sequence of hidden 
states or topics that generated the lectures in the given sequence. 
Then we greedily choose the next probable lecture in the sequence 
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Figure 4: Qualitative Analysis of Sequencing Task 3. (a) Examples of preferring self transition behaviour when selecting next lecture 
in the sequence, (b) Examples of the problem of sequencing adjacent lectures that cover the same topics. 


according to the equation: 


& = argmax P(z;|z;-1) P(2|z:) 


(21) 


Similar to task 1, for MULM and PCK-Means models we assign 
lectures to the best clusters using the equations 4 and 19 respectively. 
After that MULM greedily chooses the next lecture that maximizes 
the equation 21. On the other hand, PCK-Means model selects the 
next lecture that maximizes the topic transition and minimizes the 
distance with the last lecture in the given sequence. For the baseline, 
we use the cosine similarity where we choose the next lecture that 
has the highest similarity score with the last lecture in the given 
sequence. 


Table 3 summarizes the results of Task 2 for each method. We 
can notice that HMMULM achieves the highest accuracy in all 
datasets. Using the Viterbi algorithm along with the learned topic 
transitions helps in capturing the most probable hidden states or 
topics that generate the given sequence of lectures. In addition, 
the topic transitions learned by HMMULM help in greedily pick 
the next lecture in the sequence. While StrTM also uses Viterbi 
algorithm similar to HMMULM, its accuracy scores were far less 
than HMMULM. We think the main reason for that due to the 
performance of the learned topic transitions as we explain in section 
6.3. 


6.2.4 Task 3: Predicting The Sequence 


Task 3 is very similar to task 2 except that each method needs to find 
the whole sequence of given lectures where the first lecture in the 
sequence is given. Figure 3 depicts the results of Task 3. 


As this task is considered the most challenging task, it is clear that 
there is no wining method. However, from the upper left graph that 
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captures the Kendall’s taus in Figure 3, we can notice that HMMULM 
has achieved a taus score > 0.50 in four courses, PCK-Means has 
achieved the same score in only two courses, MULM and strT™M 
in only one course, and Cosine method in non courses. For the 
Levenshtein normalized similarity, it is clear that all methods have 
comparable results. For the directed and undirected bigram precision, 
all methods have also comparable results except MULM. The reason 
is that MULM sometimes cannot complete the whole sequence 
because it only uses the greedy method which cannot complete the 
sequence in the case of the absence of the topic transitions required 
to sequence courses in the test set. In the case of other methods, 
they always find the whole sequence either because of the Viterbi 
algorithm used by HMMULM and strTM or due to the similarity or 
distance measures utilized by PCK-Means and Cosine methods. 


In addition to quantitatively comparing the methods, we try to quali- 
tatively evaluate the results by examining the generated sequences of 
each methods. In general, we found two common behaviour shared 
by all methods. 


First, in most cases almost all the methods prefer self transition when 
they pick the next lecture in the sequence. For example, as shown in 
Figure 4 (a), MULM, HMMULM, and PCK-Means select the next 
lecture that has the same topic as the current lecture. 


Second, all methods cannot sequence lectures that belong to the same 
topic. In MOOCs, due to the short length of lectures, instructors 
sometimes explain the same topic using multiple lectures. As a result, 
it is hard to find the correct sequence of lectures that cover the same 
topic. For example, as shown in Figure 4 (b), the last four lectures 
of the course explain the “Principal Component Analysis algorithm’ 
and hence strTM, HMMULM, and PCK-Means cannot predict the 
correct sequence of these lectures. In this case, we need to use other 
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Figure 5: Python topics and their transitions using PCK- 
Means method. The table represents the top terms in each 
topic. 


techniques to predict the sequence. One naive solution is to assume 
that all adjacent lectures the belong to the same topic as one atomic 


unit and we only need to sequence lectures that have different topics. 


Further investigation of solving the sequencing problem of lectures 
that belong to the same topic is left for future work. 


6.3 Topic Transitions Examples 

In this section, we present examples of topics and topic transitions 
learned by different methods. Due to space constraint, we present 
examples using Python dataset. We try to analyze the words with the 
highest probabilities in the word distributions of topics learned by 


each methods and manually mapped them to topic words or phrases. 


For instance, if the word distribution has the words: list, range, 
items, index, and append, then it is clear that this word distribution 
captures the topic “List”. The word distributions with topic phrases 
of each topic learned by PCK-Means, MULM, HMMULM, and 
strTM methods in Python dataset are depicted in Figure 5, 6, 7, and 
8 respectively. Since we have 13 topics in the Python dataset, we 
only visualize the topic transitions of a subset of these topics and 
depicted the transitions that have scores > 0.05. 


It is clear from the Figures that all models extract some useful topics 
where the top terms of each topic clearly explain the topic. However, 
PCK-Means has the best word distributions that clearly explain each 
topic followed by HMMULM and then MULM while strTM has the 
lowest performance. We also notice from the Figures that PCKMeans 
have extracted 11 useful topics with two topics that have unclear 
word distributions and cannot be mapped to any useful topics. In 
contrast, MULM has modeled 10 meaningful topics with three topics 
form noise and cannot be mapped to any topics. On the other hand, 
HMMULM and strTM capture 9 topics with four unclear topics that 
cannot be mapped to any phrase. In general, this finding indicates 
that PCK-Means has the best performance in modeling the topics 
of the courses in the Python dataset as it models more useful topics 
with clear word distributions. The results also indicate that strTM 
achieves the lowest performance because even though it captures the 
same number of meaningful topics as HMMULM, strTM has the 
lowest performance in the clarity of the word distributions. 
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Loop _|smallest, break, statement, blah, true, 
iteration. 


float, expression, integer, floating, type, 
Data Types |int, types, floats, expressions, convert, 
decimal. 


Conditional "ve, false, block, else, condition, equal, 
expression, greater, statement, 
Statements |semame, elif, boolean. 


dictionary, key, keys, dictionaries, pairs, 


Dictionary | counts, keyvalue, list, word, tuple, values. 


sorted, random, sorting, sort, list, 
List Sort programming, accumulation, language, 
loop, lost, lesson. 


List & __|list, index, object, string, method, 
‘ methods, class, item, position, strings, 
String items. 


Installing 
Python 


ile, install, directory, py, command, tool, 


lwindows, files, window, click, folder, ipython. 


Prog. 
Language 


programming, language, learn, concepts, 
tasks, scripts, practice, learning, videos, 
script, feel. 


turtle, floor, api, random, comments, 
Not Clear |calculations, module, program, errors, stack, 
machine. 


Not Clear 


square, cube, dictionary, count, total. 


accumulator, list, guess, state, root, function, 


list, tuple, function, element, range, loop, 
Not Clear |global, recursive, tuples, sequence, 
Parameter. 


Figure 6: Python topics and their transitions using MULM 
method. The table represents the top terms in each topic. 
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Dict, 
List, Tuple 
& Sort 


Loop & 
Condition 
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Topics Terms 


‘string, index, character, strings, quotes, 
String _ ist, substring, characters, preview, 
location, slice. 


function, float, type, expression, return, 
Function (parameter, parameters, integer, int, 
parentheses, string 


File tle. csv, files, phones, row, format, open, 
column, read, table, directory. 


nag Loop & _|loop, true, false, while, condition, loops, 
‘ Condition |equal, largest, block, infinite, else. 


dictionary, list, accumulator, key, item, 


Dictionary |,.2y<, count, counts, loop, state, num 


list, object, append, lists, items, position, 


List ist reference, strings, mutate, bound. 


Dict, List, 
Tuple & 
Sort 


dictionary, list, tuple, key, sorted, function, 
keys, sort, expression, item, sorting. 


class, attributes, methods, instance, 


comments, calculations, floor, input, 


Classes _|dictionary, method, turtle, classes, module, Not Clear |algorithms, computers, warm, web, city, hot, 
objects, concepts. fast. 
Prog. _|/2"guage, programming, learn, script, data, Not cl seconds, block, traceback, hours, blah, turtle, 
Language learning, syntax, tasks, concepts, languages, lot Clear |i... code, runs, times, execute. 
scripts. 
ing, eri i kelvin, temperature, hours, fahrenheit, equal, 
Not Clear |'®2ching, eric, code, coll, grade, install, button,| Not Clear Pp a 


messages, errors, error, month. 


program, step, error, loop, celsius, indent. 


Figure 7: Python topics and their transitions using HM- 
MULM method. The table represents the top terms in each 
topic. 


Topics Terms 
x list, string, if, position, index, python, 
Conditional String __ strings, quotes, character, number, single, 


Expression type. 


function, return, code, parameter, 
Function |expression, square, inside, functions, 
print, parameters, statement, arguments. 


File _ file, csv, files, open, line, read, list, row, 
lines, data, quotes, table. 


Loop & iif, loop, code, while, run, else, statement, 
Condition [block, equal, condition, true. 


dictionary, key, list, function, keys, 
Dictionary |dictionaries, sorted, lists, sort, tuple, 


\values. 
List __ |lst-lists, element, item, index, loop, 
function, if, count, strings, file. 
Conditional ¥®: if, false, else, equal, expression, 


na code, boolean, block, greater, 
Expression loxpressions, operators. 


list, character, dictionary, item, loop, 


python, variable, string, type, if, integer, point, 


List & Dict eccunulelay variable, number, count, index, Not Clear variables, code, floating, kind. 
list, class, tuple, function, method, set, data, Not Clear [if !00P, guess, root, machine, start, simply, 
Classes Hiicies, index, methods, object. number, answer, times, cube. 
; ldata, dictionary, language, kelvin, keys, 
Not Clear |!2°P: if. tue, python, statement, largest, code, || Not Clear |programming, table, row, column, dictionaries, 


false, run, smallest, variable. 


|columns. 


Figure 8: Python topics and their transitions using strTM 
method. The table represents the top terms in each topic. 
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As shown in the Figures 5, 6,7, and 8, the meaningful topics extracted 
by all methods are very similar with some variations. For example, 
while PCK-Means and MULM separate “Files” and “CSV Files’, 
HMMULM and strT™M combines them into one topic. In addition, 
while PCK-Means and HMMULM combines “Loops & Condition’, 
MULM differentiates between them. StrTM, on the other hand, has 
both “Loops & Condition” and “Conditional Statements.” 


Regarding the topic transitions, it is clear that all models capture self 
transitions with the topic and itself. This indicates that, in MOOCs, 
instructors used multiple lectures to explain the same topic. However, 
HMMULM gave higher probability to self transitions compared to 
other methods. We can notice from the Figures that there are some 
consensus between all methods on some transitions between topics 
such as: “List” — “Dictionary” and “String” — “File.” There are also 
some variations of topic transitions between different models. For 
instance, wile PCK-Means, HMMULM, and strTM have a transition 
between “String’— “List”, MULM combines these two topics into 
one topic or cluster. Another variation is that, PCK-Means, strTM and 
MULM have a transition “Loop & Condition” — “String”, whereas 
HMMULM misses this transition. 


In general, all methods captures useful topics with clear word dis- 
tributions. Regarding the topic transitions, all methods capture self 
transitions and also have some consensus on some transitions. There 
are also some variations between methods and these differences due 
to how each method identify topics of each lecture. Improving the 
modeling of topics and the mapping between lectures and topics 
clearly would improve the quality of the topic transition maps. 


7. CONCLUSION 


In this paper, we introduce the Topic Transition Map which is a 
general structure that models the content of MOOCs as topics, where 
each lecture is mapped to a topic, and captures the transition between 
topics. It models the various ways of how instructors organize topics 
in order to construct the study plan of their courses. We investigate 
four different methods to construct the Topic Transition Map: PCK- 
Means, MULM, HMMULM, and strTM. PCK-Means and MULM 
separately cluster lectures into topics and then learn the transitions 
between topics, by leverage the sequences of lectures in different 
courses. In contrast, HMMULM and strTM assume first order Markov 
property among latent topics and hence jointly learn topics and their 
transitions. While the three model, MULM, HMMULM, and strTM 
are probabilistic models, PCK-Means is distance-based clustering 
algorithm that incorporates some constraints to guide the clustering 
process. 


We evaluated the generated topic transitions from various methods 
using three different tasks: 1) determining the correct sequence, 
2) predicting the next lecture, and 3) predicting the sequence of 
lectures. Our evaluation revealed that PCK-Means achieves the 
highest performance in determining the correct sequence while 
HMMULM outperforms other methods in the task of predicting 
the next lecture. Since the task of predicting the whole sequence 
of lectures is considered the most challenging task, there was no 
winning method and all methods have comparable performance with 
MULM has the lowest performance as it sometimes fails to predict 
the whole sequence. We also visualize the the Topic Transition Maps 
generated by different methods to qualitatively evaluate the resulted 
maps. We found that PCK-Means has extracted more meaningful 
topics with the best word distributions that clearly explain each topic. 


In the future, we plan to explore incorporating Topic Transition 
Map with concept dependency relations and examine if this can 
solve the problem of sequencing lectures that belong to the same 
topic. Further, we aim to combine different methods such as PCK- 
Means and HMMULM in order to improve the accuracy of the 
Topic Transition Map and hence improving the performance of 
the sequencing tasks. Finally, we plan to apply our work on other 
domains such as traditional University courses or educational books. 
To do that, we need to investigate how to divide long lectures or book 
sections into segments where each segment is mapped to one topic. 
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ABSTRACT 


Early predictors of student success are becoming a key tool 
in flipped and online courses to ensure that no student is left 
behind along course activities. However, with an increased 
interest in this area, it has become hard to keep track of what 
the state of the art in early success prediction is. Moreover, 
prior work on early success prediction based on clickstreams 
has mostly focused on implementing features and models for 
a specific online course (e.g., a MOOC). It remains there- 
fore under-explored how different features and models enable 
early predictions, based on the domain, structure, and edu- 
cational setting of a given course. In this paper, we report 
the results of a systematic analysis of early success predic- 
tors for both flipped and online courses. In the first part, we 
focus on a specific flipped course. Specifically, we investigate 
eight feature sets, presented at top-level educational venues 
over the last few years, and a novel feature set proposed in 
this paper and tailored to this setting. We benchmark the 
performance of these feature sets using a RF classifier, and 
we provide and discuss an ensemble feature set optimized for 
the target flipped course. In the second part, we extend our 
analysis to courses with different educational settings (i.e., 
MOOCs), domains, and structure. Our results show that 
(i) the ensemble of optimal features varies depending on the 
course setting and structure, and (ii) the predictive perfor- 
mance of the optimal ensemble feature set highly depends 
on the course activities. 


Keywords 
Flipped Classroom, MOOC, Success Prediction, Early Warn- 
ing, Clickstream, At-Risk Students, Learning Analytics. 


1. INTRODUCTION 


An increasing number of universities are now running blended 
courses that combine traditional lectures with online instruc- 
tion, providing educational models tailored to the needs of 
our society (20). A popular instructional strategy to enable 
blended learning is represented by flipped classrooms, where 
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students complete pre-class activities before attending face- 
to-face lessons (18). Recent studies have shown the positive 
impact and dependency of this strategy on student-centered 
variables such as self-efficacy and self-regulation [6| 
[19]. Pre-class activities usually consist in watching videos 
and completing quizzes part of Massive Open Online Courses 
(MOOCs) used as supplementary material 27]. Each week, 
students are asked to perform these pre-class activities and 
to then complete exercises and have discussion in class. Pre- 
class activities are fundamental for the success of flipped 
courses [28]. However, students often lack skills, 
time, and motivation to regulate their pre-class activity; as 
a consequence, they may experience difficulties in class and 
end up failing the course [14]. To ensure that no learner 
is left behind, Early Success Predictors (ESPs) are becom- 
ing crucial to support instructors in identifying and timely 
acting upon risk factors of failing a course. 


So far, there are few studies on analyzing student success in 
flipped courses based on pre-class activities. For instance, 
Jovanovic et al. 8] clustered interaction sequences in 
pre-class clickstreams to identify learning strategies, showing 
how strategy-based student profiles differ in course grades. 
Beatty et al. found that frequency counts of video us- 
age are often correlated with course grades in flipped class- 
rooms. In blended, but not flipped settings, Akpinar et 
al. showed that student’s strategy counts, with strate- 
gies modelled as clickstream event n-grams, are indicative 
of course homework grades. Wan et al. trained gra- 
dient boosting classifiers on an extensive set of clickstream- 
based features to identify at-risk students in a small private 
online course delivered in hybrid mode. They also analyzed 
the importance of the features, finding that the time spent 
in online activities and the stability of time distribution dur- 
ing weeks have the highest importance in that course. To 
the best of our knowledge, no prior work on flipped courses 
specifically focused on ESPs. 


Conversely, there is a large body of research on success pre- 
diction for fully online courses (e.g., MOOCs). A multitude 
of feature sets have been extracted from clickstreams for 
this purpose. Recent work proposed video-counting (e.g., 
number of videos viewed per week, rewinds, fastforwards, 
pauses, and plays, and the fractional and total amount of 
time played and paused for videos) and session-based (e.g., 
number of sessions, mean and standard deviation of the time 
for all sessions and between sessions) features (4} [23]. These 
features were fed into different commonly used classifiers 
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(e.g., Logistic Regression, Naive Bayes, Decision Tree, RF, 
and Neural Networks) to predict success in weekly assign- 
ments or in the entire specific MOOC. In 3], several fea- 
tures that measure intra-course, intra-week and intra-day 
regularity in video watching were proposed, and their corre- 
lation with the course grade was shown. Other researchers 
leveraged attendance rates, usage rates, and watching ra- 
tios [15]. Specifically, they explored how the difference 
in these indicators affects academic performance, showing 
that students whose indicators are high are more likely to 
graduate on schedule. More fine-grained features on video 
usage (e.g., total video views, mean and standard deviation 
of the proportion of videos watched, re-watched, and inter- 
rupted per week, and the frequency and total number of all 
video actions and of each type of video action) were pro- 
posed in fia]. The authors clustered students according to 
their watching behavior and found that such a behavior is 
representative of course performance. Similarly, Mubarak 
et al. extracted implicit features from video-clickstream 
data, and investigated the extent to which neural networks 
fed with those features can predict weekly students’ perfor- 
mance. For an extensive discussion on success prediction in 
MOOCs, we recommend this survey [5]. 


The above features and classifiers, however, are designed 
for fully-online learning contexts, such as MOOCs. Despite 
clear connections, there are essential aspects which distin- 
guish flipped courses from MOOCs. First of all, flipped 
course data includes relatively few students. A large part of 
the learning activity happens offline and cannot be tracked, 
leading to data only on course segments. Flipped courses 
generally have also an intense instructor guidance and per- 
formance on them has direct impact on the academic port- 
folio. As a motivating example, we consider a flipped course 
on Linear Algebra later described in this paper and the reg- 
ularity features proposed for MOOCs in [3]. They quan- 
tify students’ time regularity by considering their activi- 
ties over the course (e.g., studying at the same days of 
the week). Boroujeni’s study revealed that the final grade 
in the MOOC is correlated with two intra-week regularity 
measures and the periodicity of day hour and week hour 
(.46 <c< .7, p < 0.001). Conversely, the same features 
resulted to have no correlation with the final grade in the 
above flipped course (.0 < c < .1, p < 0.001). Therefore, 
it remains unexplored whether existing features and classi- 
fiers for MOOCs generalize to different educational settings 
(e.g., flipped classrooms), and to what extent the feature 
importance varies according to the topic, structure, and ed- 
ucational setting of the course. 


The contribution of this paper is two-fold: we tackle the 
problem of ESPs in flipped classroom settingd"| and we pro- 
vide an extensive analysis and benchmark of classifiers and 
features for early success prediction across different types of 
courses, namely MOOCs and flipped courses. A schematic 
overview of our analysis in this paper is shown in Figure [I] 


In a first step, we propose a novel feature set for early suc- 
cess prediction in flipped courses. Our feature set mea- 
sures students’ alignment, anticipation, and strength in quiz 
and video usage. We benchmark our new feature set using 
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a Random Forest (RF) classifier against eight feature sets 
presented in previous work on success prediction in online 
courses. We retrieved these feature sets by systematically 
scanning the recent papers published at major educational 
venues (e.g., EDM, AIED, etc.) and reproducing the fea- 
tures based on the relevant papers. Our results on data 
of 214 students enrolled in a linear algebra flipped course 
show that the novel feature set outperforms all previously 
suggested feature sets. We also show that predictive per- 
formance can be increased by selecting the optimal features 
from the ensemble of all feature sets. 


In a second step, we extend our analysis to further courses 
along three dimensions: domain, structure, and educational 
setting. We compute the early predictive performance again 
using a RF classifier for three additional courses: a flipped 
course on functional programming (where pre-class activities 
include videos only), a MOOC on linear algebra (including 
video and quiz activities), and a MOOC on functional pro- 
gramming (including video activities only). For each course, 
we select the optimal features from the ensemble of feature 
sets (eight feature sets from prior work and one novel feature 
set from this paper) as input features for the RF classifier. 
Our results show that the structure of the course signifi- 
cantly influences performance. Predictive performance for 
courses including quizzes is much higher than for courses in- 
cluding only videos. Furthermore, we also show that while 
there is some overlap between the optimal features across 
courses, the importance of the features highly depends on 
the setting and structure of the course. 


2. EARLY PREDICTION FORMULATION 


The problem addressed in this paper can be framed into a 
time series classification task that relies on clickstreams to 
predict student success in a course. For clarity and repro- 
ducibility, we present and formalize the addressed problem. 


Course. Early success predictions are provided in the con- 
text of a course (e.g., a MOOC or a course run in a flipped 
classroom setting). In what follows, we hence mathemati- 
cally define fundamental concepts, such as the course sched- 
ule, the learning objects, and their properties. Specifically, 
we consider a set of students U who are enrolled in a course 
c part of the online educational offering C. Each course 
c € C has a pre-defined schedule S, consisting of N = |S¢| 
online activities, such that S. = {s1,...,sw}. We assume 
that each online activity s; included in the course schedule 
is represented by a tuple (0;,¢;), consisting of learning ob- 
ject o; € O and its corresponding completion deadline for 
students t; € Rt, modelled as a timestamp. Each learn- 
ing object o € O is characterized by descriptive properties 
denoted with an M-dimensional vector f, = (fi,..., faz) 
over a set of features F = {Fi,..., Fac} that vary according 
to the type of the learning object (e.g., the duration for a 
video or the maximum grade for a quiz). Specifically, each 
feature F; € F can be envisioned as a set of discrete or con- 
tinuous values describing an attribute of a learning object 
0, fo,j € Fj for j = 1,...,M. Our study in this paper as- 
sumes that learning objects can be either videos or quizzes, 
but the notation can be easily extended to other types (e.g., 
forum posts or readings). The type of a learning object 
o € O is returned by a function type : O > {video, quiz}. 
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Figure 1: Our Framework. We first analyze a flipped course with videos and quizzes, then investigate differences between courses 
in flipped and MOOC settings and with videos only and videos plus quizzes. Eight state-of-the-art feature sets, a novel feature 
set, and feature ensembles are computed for each student and each week of the course. Weekly features are averaged and a 
success label is attached, according to the course type. Classification is performed using a Random Forest. Observations and 
recommendations on the predictive power of features are provided for each course setting, highlighting open challenges. 


Based on common log data collected by educational plat- 
forms, we assume that learning objects of type video, de- 
noted as 0” = {0 € O|type(o) = video}, are described 
by properties associated to the video duration in seconds as 
F’4? — (duration € R*). Learning objects of type quiz, de- 
noted as 0” = {0 € O| type(o) = quiz}, are characterized 
by descriptive properties that model the maximum grade 
students can achieve in that quiz as F!’* = (maxgrade ¢€ 
R*). For convenience, we use superscripts to denote a de- 
scriptive property of a learning object. For instance, the 
duration of a video o € O'"“ can be referred to as of", 
The same notation applies to other descriptive properties. 


Interaction. Students enrolled in an online course interact 
with the learning objects included in the course schedule, 
generating a time-wise clickstream. We denote a clickstream 
in a course c € C for a student wu € U as a time series J,,, such 
that Iu = {i1,...,7«}, with K € N (e.g., a sequence of video 
plays and pauses, quiz submissions, and so on). We leave 
these definitions very general on purpose, in particular allow- 
ing the length of each time series to differ, since our models 
are inherently capable of handling this. Likewise, we neither 
enforce nor expect all time series to be synchronized, i.e. 
being sampled at the same time, but rather we are fully ag- 
nostic to non-synchronized observations. This configuration 
is common in educational time series. We assume that each 
interaction 7; is represented as a tuple (;,a;,0;, dj), consist- 
ing of a timestamp t; € R*, the action a; € A performed by 
the student (e.g., play or pause), the learning object 0; € O 
involved in the action a; (e.g., a certain video or quiz), 
and an L-dimensional descriptive vector dj; = (d1,...,dz) 
over a set of features D = {D,...,Dz}. These descrip- 
tive vectors are used to append relevant information to an 
action a; performed at time t;, such as the current video 
time when the action occurred or the grade received by 
the student on a quiz. Based on the type of the learning 
object o € O, the student can perform different actions 
A. We assume that video interactions, denoted by {i; = 
(t;,a;,0;,d;) € 1. | type(o;) = video}, are limited to actions 
aj € A” = { Load, Play, Pause, Stop, SpeedChange, Seek}. 
These actions are derived from those commonly allowed to 
students in online educational platforms. Conversely, quiz 
interactions, denoted by {t; = (t;,a;,0;,d;) € lu | type(o;) = 
quiz}, include actions a € A”? = {Submit}. 


In online educational platforms, clickstream interactions in- 
clude a payload with additional information beyond the times- 


tamp, the action, and the involved learning object. For in- 
stance, if a student submits a quiz, the resulting interaction 
includes also the grade assigned by the system to the stu- 
dent’s quiz. Our notation models each dimension D; € D 
of a clickstream interaction as a set of discrete or contin- 
uous values describing the interaction i; € Iu, dj. € Di 
for! = 1,...,L. Specifically, we assume that interactions 
involving base video actions {i; = (t;,a;,0;,d;) € lu| aj € 
{Load, Play, Pause, Stop}} include descriptive properties as- 
sociated to the current video time the interaction occurred, 
ie. D?*S* = (current-time € Rt). Interactions involving a 
speed change in a video, denoted as {i; = (t;,a;,0;,d;) € 
Iu | aj; € {SpeedChange}}, are characterized by descriptive 
properties associated to both the old and the new speed 
the video has been and will be watched, i.e. D°?ee¢Chanse — 
(oldspeed € R*,newspeed € R*). Interactions generated 
by students while seeking the video backward or forward, 
denoted as {t; = (t;,a;,0;,d;) € Iu|a; € {Seek}}, are 
modelled by descriptive properties related to the previous 
and current video time the student moved on, i.e. D5°* = 
(oldtime € R*, newtime € R*). Finally, submit interactions 
generated in quiz activities, denoted as {i; = (t;,a;,0;,d;) € 
Iu |a; € {Submit}}, include descriptive properties on the 
grade assigned to the quiz answer and the progressive num- 
ber of the attempt made on that quiz, i.e. D°“?™*? — (grade € 
R*,subnum € R*), with grade € [0, 1). 


For convenience, we denote as I‘, the clickstream including 
interactions i; € I,, such that t; < t Vt; € Ii,, namely those 
occurred before time t. Similarly, since online activities in 
MOOGCs and flipped courses are organized on a weekly basis, 
ty identifies the time ¢t where the course week w ends. For 
instance, the clickstream of user wu generated till the end of 
the second week can be denoted as I’. 


Success Label. Once interactions are modelled, we need to 
associate a success label according to the final grade the cor- 
responding student has received for that course. We consider 
a dataset G to consist of tuples, ie. G = {(Iu;,Yu;)}, where 
I,,; denotes the interactions of student uj and yu, € {0,1} 
the pass-fail label or the above-below average grade label. 


Feature Extraction. Machine-learning models rarely receive 
raw interaction sequences, as so we abstract such interac- 
tions through a feature extraction step. Given the interac- 
tions I” C I, € 1, generated by student u till the course 
week w € N, we produce fixed-length representations in 
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H Cc R”*", where H € N is the dimensionality of the fea- 
ture set. Therefore, we assume that H-dimensional vectors 
are extracted for each week. For instance, if the feature set 
includes “number of sessions” and ”*number of clicks”, feature 
vectors of size H = 2 are extracted each week. Formally, the 
extraction process is denoted as H : I > H, from interac- 
tions to features. 


Model. Given the dataset G with interactions - success label 
pairs, an early success predictor € aims to predict the success 
label Yu; associated to the interactions Tu; - Formally, this 
operation can be abstracted as a function yu; = E(Iu,;|4), 
where ju, denotes the predicted label, @ denotes the model 
parameters, and € denotes the predictive function that maps 
interactions I; to the predicted label y,,; according to 0. 


Objective Function. Hence, training an early success predic- 
tor € with interactions - success label pairs till course week w 
becomes an optimization problem, aimed to find model pa- 
rameters 0 that maximize the expectation on the following 
objective function (i.e., predicting the correct success label, 
given the interactions) on a dataset G: 


yu = E(H(T.") | 8) (1) 


6= argmax E 
8 (Iu -yu)EG 


In this paper, we focus on feature extraction, which formally 
results in the operationalization of the function H. 


3. REPRESENTATIVE FEATURE SETS 


To make sure that our work is not only based on individual 
examples of published research, we systematically scanned 
the proceedings of conferences and journals for relevant pa- 
pers in a manual process. In our analysis, we considered 
papers that appeared in the last years in the top educa- 
tional technology conferences (e.g., LAK, EC-TEL, AIED, 
and EDM) and journals (e.g., IEEE TLT, Springer EIT, 
Journal of Learning Analytics). We considered a paper to 
be relevant if it (a) proposed a novel feature set for course 
success analysis, and (b) focused on the context of online 
courses or courses with online activities. Papers on other 
tasks, e.g., prediction of affective state or conceptual un- 
derstanding, or other educational contexts, e.g., interactive 
simulations or games, were not considered. Moreover, pa- 
pers with highly overlapping feature sets were filtered, and 
the paper with the most extensive set was used as represen- 
tative. Finally, eight papers were included in our study. 


In a next step, we reproduced the feature sets described 
in the above papers. Our approach was to rely as much 
as possible on the artifacts provided by the authors them- 
selves, i.e., their source code and the descriptions included 
into the papers. In theory, it should be possible to repro- 
duce published results using only the technical descriptions 
in the papers. In reality, there are many tiny implemen- 
tation details with an impact on experiments. Overall, we 
could reproduce with reasonable assumptions all eight fea- 
ture sets based on the relevant papers. In what follows, we 
give a description of each feature set included in our study. 


AkpinarEtAl. This feature set consists of consecutive sub- 
sequences of n clicks extracted from the session clickstreams 
of a blended course [i]. In addition to sub-sequences, the 


authors considered four features related to the number of 
clicks, the number of session clickstreams, and attendance 
information. Note that in comparison to the original pa- 
per, we extract sequences from a different set a raw events, 
namely only videos and quizzes (e.g., no events on forums). 
Hence, in our case the feature set has a size of |A’4°UA”™|” 
features per student. Since we expect short patterns to be 
un-interpretable and particularly long patterns to be rare, 
we choose n = 3 for our analyses. 


BoroujeniEtAl. This feature set was originally used to mea- 
sure to what extent MOOC students are regular in their 
study patterns [3]. Specifically, it is considered whether stu- 
dents study on certain hours of the day, day(s) of the week or 
similar weekdays. Other features monitor whether students 
have the same distribution of study time among weekdays 
over weeks, particular amount of study time on each week- 
day, and finally to what extent a student follows the sched- 
ule of the course. This set includes 9 features per student. 
Other papers proposed similar regularity features 8] 
[9| [2]. We limit our analysis to the feature set listed in |3), 
as in our first experiments it exhibited the best predictive 
power (among papers focusing on regularity features). 


ChenCui. The feature set presented in this paper |4| includes 
click countings from a mandatory undergraduate course run 
through Moodle. Features include the number of total clicks 
and of clicks on campus, the ratio of on-campus to off- 
campus clicks, the number of online sessions (with average 
and standard deviation), standard deviation of time between 
online sessions, number of clicks during weekdays or week- 
ends, ratio of weekend to weekday clicks, and the number of 
clicks for each type of module (e.g., assignment, forum, and 
quiz). To accomodate the scenario presented in Section [2] 
our study does not cover the features not easily generaliz- 
able to different types of online courses: the number of clicks 
on campus, the ratio of on-campus to off-campus clicks, the 
number of clicks for modules file, forum, report system. We 
therefore obtain a feature set of size 13 for each student. 


LalleConati. This paper focuses again on MOOCs. The 
presented feature set is composed by video interaction fea- 
tures at two levels of granularity. Features on video views 
include the total number of videos views (both watches and 
rewatches), in addition to the average and standard devi- 
ation of the proportion of videos watched, re-watched, and 
interrupted per week. On the other hand, features on actions 
performed within the videos include the frequency and total 
number of all performed video actions, frequency of video 
actions for each type of video action, and the average and 
standard deviation duration of video pauses, seek lengths, 
and so on. This feature set has a size of 22 per student. 


LemayDoleck. The next paper is also focused on MOOCs. 
Presented features include the number of videos watched 
per week, the average time fraction paused, played or spent 
watching, the average and standard deviation of the play- 
back rate, and the total number of rewinds, pauses, and 
fast-forwards. Note that this feature set includes only video- 
related measures, resulting in vectors of size 10 per student. 


MbouzaoEtAl. In this MOOC paper 15], the authors in- 
troduce three novel features, namely attendance rate, uti- 
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lization rate, and watching ratio. The attendance rate of a 
student on a given week is the number of videos the stu- 
dent played over to the total number of videos in that week. 
The utilization rate is the proportion of video play time ac- 
tivity of the student over the sum of video lengths for all 
videos on that week. Finally, the watching ratio is defined 
as the product between the two former features. This 3- 
sized feature set has been tested in MOOCs, extending an 
already-existing feature set [7]. 


MubarakEtAl. This paper is primarily focused on im- 
plicit features about video-usage behavior in MOOCs. Com- 
posed by 13 features, this set covers fine-grained characteris- 
tics, such as the percentage of the video the learner watched 
not counting repeated segments, the amount of real time the 
learner spent watching the video (i.e. when playing or paus- 
ing) compared to the video duration, and the sum of times 
a learner viewed a video in its entirety. 


WankEtAl. This set was designed for a small private online 
course [26]. Features measure the online learning time, the 
strength of the learner’s engagement in forums and weekly 
assignments, the extent to which students attempt to do 
the homework soon, as examples. Table 1 and 2 in the 
cited paper provide further details. Given that we do not 
cover forum interactions, our study does not consider forum 
features. Finally, this set includes 14 features per student. 


4. EARLY PREDICTORS IN FLIPPED 
COURSE SETTINGS 


In this section, we first present a novel feature set for flipped 
courses, based on alignment, anticipation, and strength in 
content usage. We then describe the experimental setup 
and results aimed to assess to what extent the feature sets 
(including ours) are predictive of student success. 


4.1 Our Feature Set 


The feature sets presented so far mainly tackle video-related 
features and/or consider only low-level features, with only a 
few of them including features related to quizzes or assign- 
ments. Considering that predicting the success of a student 
based on clickstream data only is a challenging task per sé, 
we believe that limiting features to those extracted from 
videos may result in inferior predictive performance. We 
therefore suggest a number of additional features assessing 
students’ knowledge and alignment with the course schedule. 


Competency Strength is defined as the average of the in- 
verse number of submissions for a quiz, weighted by the 
highest grade achieved by the student on that quiz. Given 
the inverse term, the value of this feature decreases when 
the student attempts the quiz multiple times and if the 
grade achieved by the student on the last attempt is not 
the highest-possible one. Hence, good-performing students 
may use few attempts and reach the maximum quiz grade 
fast (value close to 1). Students struggling with the material 
may attempt the quiz many times and not reach the max- 
imum grade (value close to 0). Given a student u and the 
week w of the course, this feature is computed as: 


1 1 
qEQu 


where: 


© Qu = {ojltj = (tj,4;,0;,d;) € Iu O type(oj) = quiz 
t; < tw} are the quizzes taken by student u till week w. 


© Qt = |{ijlij = (t),45,0;,d;) Eu N 07 =QN ty < tw}| 
is the number of attempts a student had on quiz q. 


eGi= ema = (tj, @;, 07, d;) Elu No =qntj < 
tw} is the set of grades a student got on quiz q. 


Competency Alignment is defined as the number of quizzes 
the student received the maximum grade until week w, di- 
vided by the total number of quizzes scheduled for the period 
of consideration. Good-performing students may receive the 
maximum grade in all quizzes for the period of consideration 
(value close to 1); low-performing students may be behind 
the schedule and pass fewer quizzes than those proposed 
(value close to 0). Given a student u and the week w of the 
course, this feature is computed as: 


our nN grealtw)| 


|Steq(tw)| (3) 

where: 
© Qe = {ost = (tj,47,0;,d)) € In M type(o;) = 
quiz i = Pict is the set of quizzes the 


student u received the maximum grade until week w. 


e sleatw) — fo; € Ol(o;,t;) € Se N type(o;) = quiz N 
t; <tw} is the set of quizzes to complete by week w. 


Competency Anticipation is defined as the number of quizzes 
attempted by the student among those in subsequent weeks 
of the current week of study. This feature can be seen as a 
proxy of the learning propensity of a student. For instance, 
if a quiz is scheduled to be solved in subsequent weeks, we 
expect that good-performing students try them earlier, an- 
ticipating the deadline stated in the platform (value close to 
1). Low-performing students may delay the consumption of 
quizzes across weeks or even towards the end of the course 
(value close to 0). Given a student u and the week w of the 
course, this feature is computed as: 


Qu al Sor(tw)| 
seal (a) 


where Q,, is the set of quizzes taken by student u until week 
w as defined in Eq. |2| and: 


e S9tltw) — fo; € Ol(o;, tj) € Se N type(oj) = quiz Nt; > 
tw} is the set of quizzes to complete after week w. 


Content Alignment is defined as the number of videos watched 
by the student until week w, divided by the total number 
of videos scheduled for the period of consideration. Good- 
performing students are expected to complete all videos for 
the period of consideration (value close to 1), while low- 
performing students may complete less videos than those 
proposed (value close to 0). Given a student u and the week 
w of the course, this feature is computed as: 


Van Steq(tw) | 
| Steq(tw)| (5) 


where: 


e Vi = {0;|t; = (t),0;,0;,d;) € lu NM type(o;) = video} 
is the set of videos watched by student u until week w. 
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eo sleatw) — fo; € O|(o;,t;) € Se N type(o;) = video N 
t; < tw} is the set of videos to watch by week w. 


Content Anticipation is defined as the number of videos com- 
pleted by the student among those in subsequent weeks of 
the current week of study. For instance, if a video is due 
the next week, we expect that good-performing students 
might watch them earlier, anticipating the deadline stated 
in the platform (value close to 1). On the other hand, low- 
performing students may tend to delay the completion of 
videos (value close to 0). Given a student u and the week w 
of the course, this feature is computed as: 


Van satttw)| 

Ye y — () 
where V,, is the set of videos watched by student wu until 
week w as defined in Eq. |5} and: 


e s9tttw) — fo; € Ol(o;,t;) € Se N type(oj) = video N 
t; > tw} is the set of videos to watch after week w. 

Student Shape is defined as the student’s tendency of re- 
ceiving the maximum grade in a quiz at the first attempt 
in a row. Good-performing students are expected to con- 
secutively receive the maximum grade in quizzes at the first 
attempt (value close to 1); students experiencing difficulties 
may require multiple attempts on each quiz, before getting 
the maximum grade (value close to 0). Given a student wu 
and the week w of the course, this feature is computed as: 


: Bech (7) 


DV (witser Pi eee l{pil(pi,li) EP A 4 = 1h] +e 


where P = {(po,lo),.--, (Pn; ln)} represents a series count- 
ing how many quizzes the student consecutively receives the 
maximum grade (/; = 1) or failed (J; = 0) at the first at- 
tempt in a row. For instance, if a student gets the maximum 
grade for the first five quizzes at the first attempt in a row, 
then is wrong in two quizzes at the first attempt, and then 
receives the maximum grade for ten quizzes at the first at- 
tempt in a row, P would be equal to {(5, 1), (2,0), (10, 1)}. 


Student Speed is defined as the average time passed between 
two consecutive attempts for the same quiz, among those 
taken by the student. This feature captures intrinsic behav- 
ior of students who take the quiz, spending less time or more 
time to attempt it, on average. Given a student wu and the 
week w of the course, this feature is computed as: 


ltq| \t3 -—t || 
p>)? aa as 
“" q€Qu 2 
where Q, is the set of quizzes ve by student wu until week 
w as defined in Eq. |2| and: 
© tq = [ts|(ty,4;,0;,4;) Eu M 0) = 2 ty > ty-1)] are 
timings between trials for u on q, chronologically. 


In the rest of the paper, we will refer to our set by Ours. 


4.2 Experimental Evaluation 

In this section, we benchmark our new feature set against 
the eight feature sets presented in prior work (see Section 
3), on early success prediction in flipped courses. For con- 
venience, we will use author-based labels to identify feature 
sets throughout the paper, but we will be more interested in 
contrasting the impact of features in those papers based on 
what they implicitly measure (not based on the authors). 


4.2.1 Experimental Setup 


Protocol. For each dataset, we applied a train-test eval- 
uation, i.e. parameters were fit on the training data set 
and the performance of the models was evaluated on the 
test data set. We performed all experiments using Random 
Forest (RF) classifiers, known to achieve a good trade-off 
between prediction accuracy and interpretability. Perfor- 
mance of all models was computed using a nested student- 
stratified (i.e. dividing the folds by students) 10-fold cross 
validation. The same folds were used for all experiments, 
across feature sets. We optimized the hyper-parameters 
of RFs via Grid Search in Scikit-Learn. Specifically, we 
tuned the following hyper-parameters: number of estimators 
(25, 50, 100, 200, 300, 500), the maximum number of features 
(sqrt, None, log2), and the splitting criterion (gini, entropy). 
More extensive grids were run, but they did not show any 
substantial improvement. To be precise, we determined 
the set of optimal hyper-parameters as follows: within each 
iteration, we ran an inner student-stratified 10-fold cross- 
validation on the training set in that iteration, and selected 
the combination of hyper-parameter values yielding the high- 
est accuracy on the inner cross-validation. Note that we 
trained RFs by weeks: the RF for week w of a given course 
was trained on data collected up to week w. To obtain the 
input features for RF for week w, we computed the weekly 
features for the selected feature set and averaged them. 


Data Set: LA-Flip. We consider a Linear Algebra course 
for undergraduate students taught in a flipped format for 
10 weeks at EPFL. Typical pre-class work included a list 
of video lectures and online quizzes from a Linear Algebra 
MOOC. The final exam grade, lying between 0 and 6, with 4 
as passing threshold, is considered as a measure for students’ 
performance. The repeating students were filtered out, given 
that their repeated exposure to the material might add a 
bias to our findings. The final dataset consists of clickstream 
data from 214 students, with 41% of them failing the course. 
The study was approved by the university’s ethics committee 
(HREC No. 058-2020/10.09.2020). 


4.2.2 Observations 

We evaluated the predictive accuracy of RF classifiers trained 
on the different feature sets extracted from LA-Flip under a 
binary classification that aims to identify passing and fail- 
ing students early, as described in Section We further 
also trained RF classifiers only on the most important fea- 
tures selected from all features (denoted as EnsembleAll) 
and from all features except ours (denoted as EnsembleB- 
utOurs). Figure [2] reports the balanced accuracy, the area 
under the ROC curve (AUC), and the individual percentage 
of passing and failing students correctly identified (recall) 
for each feature set over all weeks and folds. 


The lowest-performing feature sets appear those monitoring 
students’ regularity (orange) and attendance and utilization 
rates (blue). Hence, a first conclusion we can draw is: 


Highlight #1. Regularity and attendance/utilization fea- 
tures, powerful in MOOCs, do not allow to distinguish pass- 
ing from failing students in the considered flipped course. 


The feature sets mostly related to video-clicking behavior, 
such as those from Lemay & Doleck, do not lead to substan- 
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Figure 2: [LA-Flip]. Effectiveness of a RF classifier trained on 
separate feature sets and on ensembles. Our feature set is es- 
sential to increase the effectiveness of the classifier, especially 
in terms of Non-Succeed (failing students) Recall. 


~ AkpinarEtAl — LalleConati — Ours 
—— ChenCui —— EnsembleButOurs EnsembleAll 


Week 


Figure 3: [LA-Flip]. AUC for the best six feature sets. The 
ensemble of all features (grey) leads to an increase in effec- 
tiveness, with respect to considering feature sets separately. 


tial differences from each other and all achieved a balanced 
accuracy between 55% and 59% (similarly for AUC). This 
finding might reveal an intrinsic limit for video features in 
predicting student success from pre-class activities. Our re- 
sults also raise the question on how and why a certain type 
of video features should be preferred compared to others. 


Highlight #2. In this flipped course, there are minimal 
differences in performance among video-usage features; an 
intrinsic predictive limit for video-usage features exists. 


This motivates investigation on the impact of features tar- 
geting quiz usage. In this direction, the features proposed 
by Wan et al. cover a range of raw counting and timing mea- 
sures that target quizzes. Figure [2] shows that this feature 
set is even worse that just using video features. Conversely, 
by measuring more complex patterns in quiz consumption, 
our feature set led to a balanced accuracy of 67% (simi- 
larly for AUC). To identify the aspect our features make the 
difference at, we considered the percentage of passing and 
failing students correctly classified, as shown in the two bot- 
tom plots in F igure [2] While there are no substantial differ- 
ences among our feature set and the other ones in identifying 
passing students (Succeed Recall), a clear improvement is 
obtained in the detection of failing students (Non-Succeed 
Recall), fundamental to ensure fewer students are left be- 
hind. The impact of our features can be also appreciated 
across weeks in Figure[3| Our features allowed the ensemble 
to be effective in the first weeks, while both ours and other 
features jointly led to an improvement in the second part of 
the course. Given our results and the characteristics of our 
features, we can observe that: 


Highlight #3. Extracting fine-grained features that model 
alignment, anticipation and strength of video/quiz usage 
results in higher predictive power on failing students. 


Though considering the feature sets separately allowed us 
to perform a fine-grained assessment and have an estima- 
tion of their predictive power, it remains unclear how the 
effectiveness of early predictors can be improved by training 
models with an ensemble of all features and to what extent 
the importance of the considered features varies. Hence, on 
the right side of the plots in F igure [2 we present the results 
achieved by a RF classifier only with the most important 
features selected from all features and from all features ex- 
cept ours. It can be observed that the optimal ensemble 
of features without ours results in lower performance, com- 
pared to the optimal ensemble that uses also our features. 
The optimal ensemble of all feature has an AUC score con- 
sistently higher than 0.70. To inspect what drives success 
prediction, we computed the feature importance over weeks 
and folds, and reported in Figure [4] the importance of fea- 
tures (short description in eet selected by RF. Looking 
at importance scores in Figure|4{a), we observe that: 


Highlight #4. The extent to which students anticipate con- 
tent consumption, the tendency of learning during week- 
ends, the proportion of watched videos, and the strength of 
their performance in quizzes, had the highest importance. 


Figure [4{b) shows that the difference in importance across 
features is more evident in the first weeks. This finding 
emphasizes the fact that selecting appropriate features is 
more crucial when interested in very early predictions. 


5. EARLY PREDICTORS OVER COURSES 


Our exploratory analysis revealed interesting patterns on 
the predictive power and importance of a range of features. 
However, it remains under-explored the extent to which the 
patterns identified in that flipped course hold also in courses 
with other structures and educational settings. To this end, 
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(b) Feature importance over weeks. 


Figure 4: [LA-Flip]. Importance of the best nineteen features selected by a RF classifier from the ensemble of all feature sets. 
Four features of our set have been selected as important. Table[i]lists the Feature IDs and the short description of each feature. 


Table 1: 


[LA-Flip]. 


Description of the most important nineteen features selected by a RF classifier from the ensemble of all 


feature sets, showed in decreasing order of importance. Four features of our set have been selected among the top eleven. 


ID | Set Name Short Description 

fo | Ours Competency Anticipation The extent to which the student approaches soon a quiz provided in subsequent weeks. 

fi | LalleConati WeeklyPropWatched-Avg The proportion of videos the student watched, counting repeating segments. 

fe | ChenCui RatioClicksWeekendDay The ratio between clicks happened during weekend and weekdays. 

fg | Ours Content Anticipation The extent to which the student approaches soon a video provided in subsequent weeks. 

fa | Ours CompetencyStrength The extent to which a student passes a quiz getting the maximum grade with a low number of trials. 


fs | BoroujeniEtAl 
fé | LalleConati 


RegWeeklySim-M2 


fz | WanEtAl NumSubmissionCor 7 
fg | WanEtAl NumSubmissions-Avg The number o: 
fo | LalleConati WeeklyPropInter-Avg 
fio | Ours StudentShape 
fir | WanEtAl NumSubmissionPerCorrect Percentage of 
fig | LalleConati WeeklyPropReplayed-Avg 
fi3 | LalleConati FrequencyEvent-VideoPlay 
fia | AkpinarEtAl QCheck-QCheck-VLoad The amount o: 
fis | AkpinarEtAl VPlay-VPause-VLoad The amount o: 
fie | BoroujeniEtAl RegPeriodicity-M3 
fiz | BoroujeniEtAl RegWeeklySim-M1 
fig | AkpinarEtAl VStop-PCheck-VLoad The amount o: 


The extent to which the student has a similar distribution of workload among weekdays across weeks. 
WeeklyPropInter-Std The standard deviation of the time the student spent while interrupting a video, across videos. 

[he average number of quizzes attempted and correct. 

submissions required to pass a quiz, on average. 

The average time the student spent while interrupting a video, across videos. 

The extent to which the student receives the maximum grade in quizzes at the first attempt in a row. 
he correct quiz submissions with respect to the total submissions. 

The proportion of videos the student re-watched, not counting repeating segments. 

The frequency of the video play action in the students’ online sessions. 

times the student checks twice a given quiz and then go to load a video. 

times the student plays a video, pause and then load the next one. 

The extent to which the daily study pattern is repeating over weeks (e.g., same days of the week). 
The extent to which the student works on the same weekdays. 

times the student stops a video, attempts a quiz and then load the next video. 


we extended our analysis to a flipped course in a different do- 
main (Functional Programming, only video data in pre-class 
activities), a MOOC in the same domain (Linear Algebra, 
both videos and quizzes), and a MOOC from a different do- 
main (Functional Programming, only video interactions). 


5.1 Experimental Setup 


Protocol. In this experiment, we followed the steps described 
in Section with few exceptions. Specifically, for each 
data set, we considered only classifiers trained with the opti- 
mal ensemble of all features proposed in prior work plus the 
ones proposed in this paper. To obtain the input features 
for the RF classifier on week w, we computed the weekly 
features for all feature sets; then averaged features of the 
same week, and finally averaged across weeks till week w. 
For each course, we computed the most important features 
from the ensemble (eight existing sets and ours) based on the 
average importance of the features across folds and weeks. 
The study was approved by the university’s ethics commit- 
tee (HREC No. 096-2020/09.04.2020). 


Data Set: FP-Flip. We consider one stream of a Functional 
Programming course taught to EPFL Master’s students in a 
flipped manner for 10 weeks. The preparatory work included 
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a list of videos from a Functional Programming MOOC. Re- 
peating students were filtered out. Being a Master’s course 
with a failing percentage of only 5%, we considered whether 
a student’s final course grade (lying between 0 and 6) was 
above the average grade over all students as a success label. 
The dataset consists of clickstreams from 218 students, with 
38% of them being below average. 


Data Set: LA-MOOC. The content used in pre-class activi- 
ties within LA-Flip was also provided by EPFL instructors 
on an external MOOC platform in form of three separate 
MOOGCs, with the first MOOC being equivalent to the first 
4 weeks of the flipped course, the second MOOC equivalent 
to week 5 to week 8, and the third MOOC equivalent to 
the last 3 weeks. Given that the first 4 weeks of LA-Flip 
were delivered in a traditional manner, we excluded the first 
MOOC from our study. We also excluded the third MOOC, 
given that the number of enrolled students was barely small. 
To sum up, our study in this paper considers only the sec- 
ond MOOC that covers the second part (weeks 5 to 8) of the 
flipped course. To pass the course, it is mostly necessary to 
obtain at least 60% of the total points for each assignment. 
Hence, we used this rule as a way to measure success in our 
study. The final data set consists of clickstream data from 
170 students, with 33% of them failing the course. 
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Table 2: Features selected as important by RF classifiers for the ensemble of features for each course. 


Set Name Short Description LA-Flip — FP-Flip | LA-MOOC  FP-MOOG 
QCheck-QCheck-QCheck The amount of times the student checks three times the same quiz. ov 
QCheck-QCheck-VLoad The amount of times the student checks twice the quiz and then go to load a video. v 
VPlay-VPlay-VPlay The amount of times the student clicks for three consecutive times on play for three different videos. v 
AlpinarRtAl VPlay-VPause-VLoad The amount of times the student plays a video, pause and then load another one. v 
pur, VPlay-QCheck-QCheck The amount of times the student plays a video, then checks twice a quiz. v 
VPlay-VStop-VPlay The amount of times the student plays a video, stops, and plays another one. v 
VPause-VSpeedChange-VPlay The amount of times the student pauses a video, changes the speed, and re-plays it. v 
VStop-VPlay-VSeek The amount of times the student stops a video, then re-play it and seek to a given part. v 
VStop-VCheck-VLoad The amount of times the student stops a video, checks a quiz and then load another video. v 
DelayLecture The average delay in viewing video lectures, as soon as they are released. ov v 
RegWeeklySim-M1 The extent to which the student works on the same weekdays across weeks. v v 
RegWeeklySim-M2 The extent to which a student has a similar distribution of workload among weekdays across weeks. v v 
BoroujeniEtAl RegWeeklySim-M3 The extent to which the time spent on each day of the week is similar for different weeks of the course. v 
fs RegPeriodicity-M1 The extent to which the hourly pattern of student’s activities is repeating over days. v 
RegPeriodicity-M3 If the daily study pattern is repeating over weeks (e.g. is active on Monday and Tuesday in every week). v v 
RegPeakTime-M1 The extent to which students’ activities are centered around a particular hour of the day. v 
RegPeakTime-M2 The extent to which students’ activities are centered around a particular day of the week. v 
RatioClicksWeekendWeekdays The ratio between clicks in weekdays and weekends. ov ov ov v 
TimeSession-Avg The average amount of time spent from a login to the end of the session. v 
ChenCui TimeSession-Std The standard deviation of time spent from a login to the end of the session. v 
TimeBetweenSessions The average amount of time passed between two sessions for a student. v 
TotalClicks-Weekdays The number of clicks performed by a student over weekdays. v 
PauseDuration-Avg The average amount of time spent in pause while interacting with a video. ov 
SeekLength-Std The extent to which the seek length varies across videos. v 
PauseDuration-Std The extent to which the pause duration varies across videos. v 
TimeSpeedingUp-Avg The average amount of time spent with higher than 1x speed while playing a video. v 
TimeSpeedingUp-Std The extent to which the time spent speeding up higher than 1x the videos varies. v 
LalleConati Weekly Prop Watched-Avg The proportion of videos the student watched, counting repeating segments. v 
WeeklyPropInter-Avg The average time the student spent in interrupting a video. v 
WeeklyPropInter-Std The deviation of the time the student spent in interrupting a video. v 
WeeklyPropReplayed-Avg The proportion of videos the student re-watched, counting repeating segments. v 
WeeklyPropReplayed-Std The deviation of the proportion of videos the student re-watched, counting repeating segments. ov 
FrequencyEvent-VideoPlay The frequency of the play event in the students’ sessions. v v 
MubarakEtAl ~~ SpeedPlayBack-mean The average speed the student used to play back a video. ov v 
NumSubrr The number of quizzes attempted and correct. ov 
WanEtAl NumSubm The number of submissions performed for a quiz, on average. v v 
es NumSubmissionPerCorrect The percentage of the correct quiz submissions with respect to the total submissions. v v 
NumSubmissionDistinct The total number of distinct problems attempted by the student. v 
Competency Anticipation The extent to which the student approaches soon a quiz provided in subsequent weeks. v 
ContentAnticipation The extent to which the student approaches soon a video provided in subsequent weeks. ov 
Ours CompetencyStrength The extent to which a student passes a quiz getting the maximum grade with a low number of trials. v ov 
StudentShape The extent to which the student receives the maximum grade in quizzes at the first attempt in a row. ov ov 
Student Speed The average amount of time passed between two submissions for the attempted quizzes. ov 
10 * 2 
; ee ee ee ea achieved higher AUC scores than their MOOC counterpart. 
oe i This difference can be due to multiple reasons, for exam- 
ple the different educational setting or the way the passing 
0.8 T . . : . 

. rule for the course is set up. Considering courses in the same 
yo! | setting (LA-Flip VS FP-Flip or LA-MOOC VS FP-MOOC), 
] A : 3 é a 
6 i A the results show that including quizzes in the LA-Flip course 

l : allows to increase the predictive power of the considered clas- 
0.5 ° : : . . 
° I sifiers, compared to FP-Flip, that has no quizzes. This can 
0.4 ° ° be associated to the fact that quizzes are a good source of in- 
Ae i formation for grasping the students’ performance. The same 
° 2 3 4 5 6 7 8 9 . : 
Week observation is, however, less strong on the MOOC counter- 


Figure 5: AUC scores per week for RF classifiers trained on 
feature ensembles. Flipped courses (*-Flip) last 10 weeks; 
LA-MOOC (FP-MOOC) last 4 (6) weeks. 


Data Set: FP-MOOC. The content delivered in pre-class 
activities in FP-Flip was also provided by EPFL instruc- 
tors on an external MOOC platform in form of two sepa- 
rate MOOCs, with the first MOOC being equivalent to the 
first 6 weeks of the flipped course and the second MOOC to 
the subsequent weeks. No data was available on the second 
MOOC, so we limited our study to only the first MOOC 
(week 1 to 6 of the flipped period of FP-Flip). To pass this 
MOOC, 80% of the total points for each of the five graded 
assignments are mostly needed. Hence, we used this rule to 
measure success in our study. The dataset consists of click- 
streams from 3,565 students, with 52% failing the course. 


5.2 Observations 

We evaluated the predictive performance of a RF classifier 
across weeks for each course for the best ensemble feature 
set for that course. F igure [5] illustrates the predictive per- 
formance across weeks for all four courses. Considering the 
same course across different settings (flipped or MOOC), it 
can be observed that RFs trained on flipped course data 


part of the same two courses, highlighting again the high 
dependency from the educational setting. 


In a second part of this experiment, we analyzed the av- 
erage importance across weeks of the features selected by 
RFs across courses. Table [2]shows for each feature set and 
course, whether a given feature has been selected by the cor- 
responding RF classifier. It should be noted that this table 
includes only features picked at least by a RF classifier across 
courses. In general, we show that while there is some overlap 
between the optimal features across courses, the importance 
of the features highly depends on the setting and structure of 
the course. The ratio of clicks between weekends and week- 
days (ChenCui - RatioClicksWeekend Weekdays) is selected 
by all classifiers in all settings. Other features with a good 
level of generalizability are represented by those measuring 
regularity (BoroujeniEtAl). The other features were picked 
according to the setting or the structure of the course. In 
particular, RF's trained on LA-Flip and LA-MOOC assigned 
a higher importance to features that measure behavior in 
quizzes (e.g., Ours or WanEtAl). Hence, we can conclude 
that when available, features on quizzes are frequently se- 
lected, regardless of the setting. For courses with no quizzes, 
namely FP-Flip and FP-MOOC, the predictive power of RFs 
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is mainly based on regularity and fine-grained video usage 
(e.g., features on time spent in a video, e.g., LalleConati). 


Highlight #5. When quizzes are included in the schedule, 
quiz-related features are frequently selected as important. 
This is stronger in flipped than MOOC settings. When only 
videos are available, the predictive power mainly derives 
from regularity and fine-grained video-related features. 


For the same course in different settings, namely LA-Flip VS 
LA-MOOC and FP-Flip VS FP-MOOC, the optimal feature 
set heavily changed. In LA courses, quiz-related features 
were more important in the flipped context, while session- 
based features were more important in MOOCs (e.g., those 
from ChenCui). The latter finding holds for FP courses as 
well. Specifically, RFs trained on the MOOC version con- 
sistently selected features related to the students’ session. 
Another observation for FP is that in the flipped version, 
tri-grams (ApkinarEtAl) and fine-grained video usage fea- 
tures (LalleConati) were picked; in the MOOC, regularity 
and session-based features were more important. To sum 
up, according to Table [3] 


Highlight #6. Predictors in flipped settings often rely on 
features based on tri-grams and fine-grained video consump- 
tion. Conversely, predictors in MOOCs consider regularity 
and session-based features as important. Quiz-related fea- 
ture are picked in both settings, when quizzes are available. 


6. DISCUSSION 


In this section, we connect the main findings coming from 
the individual experiments and present the implications and 
limitations of our study in the early success prediction task. 


Course-Related Observations. A challenge, as our work shows, 
lies on the generalizability of feature predictive power across 
courses. The variability of the results when repeating the 
exact same experiment with data from different courses (or 
slightly different settings) is very high. It is therefore chal- 
lenging to understand when, why, and how a feature tested 
on a given course could be re-used for other courses. 


Highlight #7. The predictive power of features does not of- 
ten generalize across courses with different structures and 
educational settings. This observation is stronger with re- 
spect to the courses structure than between flipped and 
MOOC settings. 


This observation affects the scalability of early predictors. 
Being so course-dependent, identifying and enabling fea- 
tures predictive of student success for a given course can 
take hours or days, given that the intellectual and experi- 
mental work needs to be replicated on courses, case by case. 


Highlight #8. The lack of feature predictive power gener- 
alizability questions the extent to which a feature can be 
scaled across courses with the same structure/setting. 


Our experiments also showed that including quizzes in pre- 
class activities leads to substantial improvements in effec- 
tiveness. Hence, success prediction is driven by complex re- 
lationships between students’ characteristics and the course 
domain, structure, and educational setting. 


Data-Related Observations. Research in the area of early 
success prediction is often conducted on data extracted from 


online activities only. Even in our case study (for LA-flip), 
we could not rely on data collected in class, missing an im- 
portant segment of learning. Moreover, clickstreams in this 
study do not cover other relevant interactions such as those 
in forums. In flipped courses, most (non-digitalized) dis- 
cussions happen in class, and the forum is mainly used by 
teachers for announcements. 


Highlight #9. Early success prediction in flipped courses 
would benefit from including data coming from offline ac- 
tivities (e.g., in class). 


Workflow-Related Observations. To establish reproducibil- 
ity, the description of the proposed features should go be- 
yond plain-text only. Our formulation in this paper can be 
re-used to define features as formulas, making it easier to 
replicate them, especially when no source code is provided. 


Highlight #10. Feature descriptions can be accompanied 
by their mathematical formulation to ease reproducibility. 
When possible, sharing the code can facilitate their re-use. 


Though we validated the current features on RFs, other 
classifiers were not presented. However, RFs often provide 
the best trade-off between effectiveness and interpretability 
(the latter was fundamental for our study) and our frame- 
work makes it easy to run this analysis on other classifiers. 
Given that other classifiers (e.g., Support Vector Machines) 
gave worse (or comparable) results in the preliminary exper- 
iments we ran, our results depict a valid picture of feature 
predictive power. 


7. CONCLUSIONS AND FUTURE WORK 


In this paper, we analyzed recent features for early success 
prediction in flipped and online courses. First, we inves- 
tigated the predictive power of eight existing feature sets 
and a novel feature set proposed in this paper on a flipped 
course. We benchmarked the predictive power of features 
using a RF classifier, and discussed the ensemble feature set 
optimal for that course. We then extended our analysis to 
courses with other settings (MOOCs), domains, and struc- 
tures, showing that the optimal ensemble and its predictive 
power vary. Our work calls for generalizable early predictors 
across courses with different characteristics. To promote re- 
search in this field, we also publicly release the source code 
developed during our study (see the footnote in Section 1). 


In future work, we plan to extend our analysis to other fea- 
tures (e.g., based on in-class data), and types of student 
success tasks (e.g., grade prediction). We also plan to ana- 
lyze more advanced classifiers and to devise robust classifiers 
across courses before testing them in the real world. 
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ABSTRACT 


Interactive simulations allow students to independently ex- 
plore scientific phenomena and ideally infer the underlying 
principles through their exploration. Effectively using such 
environments is challenging for many students and there- 
fore, adaptive guidance has the potential to improve stu- 
dent learning. Providing effective support is, however, also a 
challenge because it is not clear how effective inquiry in such 
environments looks like. Previous research in this area has 
mostly focused on grouping students with similar strategies 
or identifying learning strategies through sequence mining. 
In this paper, we investigate features and models for an early 
prediction of conceptual understanding based on clickstream 
data of students using an interactive Physics simulation. To 
this end, we measure students’ conceptual understanding 
through a task they need to solve through their exploration. 
Then, we propose a novel pipeline to transform clickstream 
data into predictive features, using latent feature represen- 
tations and interaction frequency vectors for different com- 
ponents of the environment. Our results on interaction data 
from 192 undergraduate students show that the proposed 
approach is able to detect struggling students early on. 


Keywords 
skip-grams, early classification, interactive simulations, con- 
ceptual understanding 


1. INTRODUCTION 


Over the last years, interactive simulations have been in- 
creasingly used for science education (e.g, the PhET simu- 
lations alone are used over 45M times a year [1]). Interactive 
simulations allow students to engage in inquiry-based learn- 
ing: they can design experiments, take measurements, and 
test their hypotheses. Ideally, students discover the prin- 
ciples and models of the underlying domain through their 
own exploration [2], but students often struggle to effec- 
tively learn in such environments [3, 4, 5]. A possible reason 
for this is that interactive simulations are usually complex 


Jade Cock, Mirko Marras, Christian Giang and Tanja Kaser “Early Pre- 
diction of Conceptual Understanding in Interactive Simulations”. 2021. 
In: Proceedings of The 14th International Conference on Educational Data 
Mining (EDM21). International Educational Data Mining Society, 161- 
171. https://educationaldatamining.org/edm2021/ 

EDM ’21 June 29 - July 02 2021, Paris, France 


christian.giang@epfl.ch 


Christian Giang Tanja Kaser 
EPFL EPFL 


tanja.kaeser@epfl.ch 


and unstructured environments allowing students to choose 
their own action path [6]. Providing adaptive guidance to 
students has therefore the potential to improve learning out- 
comes. 


Implementing effective support in interactive learning en- 
vironments is a challenge in itself: the complexity of the 
environment makes it difficult to define a priori how suc- 
cessful student behaviour looks like. Previous research has 
focused on leveraging sequence mining and clustering tech- 
niques to identify the key features of successful interactions. 
For example, [7] have used an information theoretic sequence 
mining approach to detect differences in the interaction se- 
quences of students with high and low prior knowledge, 
while [8] investigated the effects of prior knowledge activa- 
tion. Other work [9] focused on detecting behaviours leading 
to the design of a correct causal explanation. [10] identified 
key factors for successful inquiry: focusing on an unknown 
component and building contrastive cases. Similarly, [11] 
found that the identification of the dependent variable and 
its isolated manipulations lead to a better quantitative un- 
derstanding of the phenomena at hand. Another technique 
is to manually categorise students’ log data and use the 
tags as ground truth for a classifier of successful inquiry 
behaviour [12]. [13] developed a dashboard displaying in- 
formation about the mined sequences to guide teachers in 
building their lessons. 


More work has focused on analysing and predicting students’ 
strategies in different types of open ended learning environ- 
ments (OELEs), such as educational games. Prior research 
in that domain has, for example, investigated students’ prob- 
lem solving behaviour [14], analysed the effect of scaffolding 
on students’ motivation [15], extracted strategic moves from 
video learning games [16], detected different types of confu- 
sion [17], or identified students’ exploration strategies [18]. 


Most of the previous work on OELEs has performed a pos- 
teriori analyses. However, in order to provide students with 
support in real-time, we need to be able to detect strug- 
gling students early on. Due to the lack of clearly defined 
student trajectories and underlying skills, building a model 
of students’ learning in OELEs is challenging. A promis- 
ing approach for early prediction in OELEs is the use of a 
clustering-classification framework [19]: in the (first) offline 
step, students are clustered based on their interaction data 
and the clustering solution is interpreted. The second step 
is online: students are assigned to clusters in real-time. This 
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Figure 1: User interface of the PhET Capacitor Lab with different plate parameters for a closed (A) and an open circuit (B). 
The two initial closed circuit configurations and the resulting four open circuit configurations presented to the participants in 
the capacitor ranking task (C). (Simulation image by PhET Interactive Simulations, University of Colorado Boulder, licensed 


under CC-BY 4.0, https://phet.colorado.edu). 


framework has been successfully applied to analyse and pre- 
dict students’ trajectories in mathematics learning [20], to 
differentiate between ‘high’ and ‘low’ learners [21], to build 
student models for interactive simulations [22], or to predict 
students’ exploration strategies in an educational game [23]. 


In this paper, we aim at early predicting conceptual un- 
derstanding based on students’ log data from an interactive 
Physics simulation. All our analyses are based on data col- 
lected from 192 undergraduate Physics students interacting 
with a PhET simulation. We propose a novel pipeline for 
transforming clickstream data into predictive features using 
latent feature representations and frequency vectors. Then, 
we extensively evaluate and compare various combinations 
of predictive algorithms and features on different classifica- 
tion tasks. In contrast to previous work using unsupervised 
clustering to obtain student profiles [21, 20, 23, 22], our 
learning activity with the simulation includes a task specifi- 
cally designed to assess students’ conceptual understanding. 
With our analyses, we address three research questions: 1) 
Can students’ interaction with the data be associated with 
the gained conceptual understanding? 2) Can conceptual 
understanding be inferred through sequence mining meth- 
ods with embeddings? 3) Can the proposed methods be 
used for early predicting students’ conceptual understand- 
ing based on partial sequences of interaction data? 


Our results show that all tested models are able to predict 
students’ conceptual knowledge with a high AUC when ob- 
serving students’ full sequences (offline). The best models 
are also able to detect struggling students early on and to 
provide a more fine-grained prediction of students’ concep- 
tual knowledge later during interaction. 


2. CONTEXT AND DATA 


All experiments and evaluations of this paper were con- 
ducted using data from students exploring an interactive 
simulation. In the following, we describe the learning activ- 
ity, the data collection, and the categorisation of students’ 
conceptual understanding at the end of the learning activity. 


Learning Activity. The data for this work was collected in 
a user study where participants were asked to engage in 
an inquiry-based learning activity with the PhET Capacitor 
Lab simulation’. The Capacitor Lab is an interactive sim- 
ulation with a simple and intuitive interface allowing users 
to explore the principles behind a plate capacitor (Fig. 1A 
and B). Specifically, students can load the capacitor by ad- 
justing the battery and observe how the capacitance and 
the stored energy of the capacitor change when adjusting 
the voltage, the area of the capacitor plates or the distance 
between them. After loading the capacitor, the circuit can 
be opened through a switch, and students can again observe 
how manipulation of the different components influences ca- 
pacitance and stored energy. Moreover, the simulation pro- 
vides a voltmeter, while check boxes in the interface allow 
users to enable or disable visualisations of specific measures. 


Based on this simulation, a learning activity was designed 
in which participants had to explore the relationships be- 
tween the different components of the circuit and rank four 
different capacitor configurations by the amount of stored 
energy. The configurations were generated based on two ini- 
tial setups (I and II, respectively) representing capacitors in 
a closed circuit with different settings for battery voltage, 


‘https: / /phet.colorado.edu/en/simulation/capacitor-lab- 
basics 
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plate area and separation (Fig. 1C). For each initial setup, 
two open circuit configurations were generated (i.e. config- 
urations 1 & 2 from setup I and 3 & 4 from setup II) by 
opening the switch and then changing the values for plate 
area and separation. To complete this ranking task, partici- 
pants were allowed to use the simulation for as much time as 
they needed. It should be noted that the values in the rank- 
ing task were chosen outside the ranges of the adjustable 
values in the simulation such that students could not simply 
reproduce the four configurations in the simulation, but had 
to solve the task by figuring out the relationships between 
the different components and the stored energy. 


Data Collection. Data was collected from 214 first-year un- 
dergraduate Physics students who completed the capacitor 
ranking task as part of a homework assignment. While 
working with the simulation, students’ interaction traces 
(i.e. clicks on check boxes, dragging of components, moving 
of sliders) were automatically logged by the environment. 
Moreover, it recorded students’ final answers in the rank- 
ing task (i.e. the ranking of the four configurations). All 
data was collected in a completely anonymous way and the 
study was approved by the responsible institutional review 
board prior to the data collection (HREC number: 050- 
2020/05.08.2020). After a first screening of the data, several 
log files (9) were excluded because of inconsistencies in the 
data, and another 13 because they had barely any interac- 
tion (less than 10 clicks) with the environment. Removing 
these data points resulted in a data set of 192 students used 
for our analyses. 


Categorisation of conceptual understanding. The design of 
the ranking task allows to relate students’ responses to their 
conceptual understanding of a capacitor. For this purpose, 
we analysed the 16 (out of 24 possible) rankings submit- 
ted by the students with regards to conceptual understand- 
ing and grouped them accordingly. To this end, three con- 
cepts of understanding associated with the functioning of 
capacitors were evaluated in a top-down approach: answers 
were first separated by those representing an understand- 
ing of both the open and closed circuit (label both), and 
those only representing an understanding of the closed cir- 
cuit (label closed). For those answers representing an under- 
standing of both the open and closed circuit, two cases were 
distinguished. It was assumed that students who chose the 
only correct ranking of the configurations (“4213”) gained an 
exhaustive understanding of the underlying concepts (label 
correct). Students who instead chose one of the other rank- 
ings were assumed to know how plate area and separation 
influence the stored energy in both the open and closed cir- 
cuits, but failed to discover the influence of voltage on stored 
energy (label areasep). Within those answers that only rep- 
resented an understanding of the capacitor’s functioning in 
the closed circuit, we also distinguished between two cases. 
The first case represents the answer that would be consid- 
ered correct if the task was to order the four configurations 
by capacitance instead of energy (“1324”, label capacitance). 
Interestingly, 47 students (i.e. 24% of all students) submit- 
ted this ranking as an answer. The second case represents all 
other possible answers (label other) that could be submitted 
if (a part of) the closed circuit was understood. 


Based on these three underlying concepts, we generated a 


Circuit 
understanding 
Open and closed Closed only 
BOTH CLOSED 
Exhaustive Ranking 
understanding performed by 
Yes No Capacitance Other 
4213 (38) 4231 (25) 1324 (47) 1243 (5) | 4123 (1) 
correct | 2431 (2) capacitance | 1342 (3) | 4132 (2) 
4321 (10) 3412 (4) | 1432 (1) 
2413 (20) 3124 (3) 134 (1) 
2143 (3) 1234 (27) 
AREASEP OTHER 


Figure 2: Tree used to map the 16 different rankings submit- 
ted by the students to class labels associated with conceptual 
understanding of a capacitor. The different class labels are 
indicated in capitalised letters. The numbers in parentheses 
indicate the number of submissions for each ranking. 


decision tree with four leaves (each representing a group 
with similar conceptual understanding) and mapped all 16 
rankings submitted by the students to the leaves (Fig. 2). 
These generated class labels will serve as ground truth labels 
for the classification task presented in the following sections. 


3. METHOD 


Using our proposed approach, we are interested in predicting 
the conceptual understanding students gain from interact- 
ing with the simulation. Therefore, we are solving a super- 
vised classification problem, i.e. we aim at predicting the 
class labels (representing students’ conceptual understand- 
ing) based on the observed student interactions. Our model 
building process to solve this classification problem consists 
of four steps (Fig. 3). We first extract the raw clickstream 
events from the logs and process them into action sequences. 
We then compute three different types of features for each 
action sequence and feed them into our classifiers. 


Event Logs. From the simulation logs, we extract the click- 
stream data of each student s as follows: anything between 
a mouse click/press and a mouse release qualifies as an event, 
while anything between a mouse release and mouse click/press 
is called break. Each event is then labelled by the compo- 
nent the user was interacting with at the mouse click, and 
chronologically arranged with the breaks into a sequence. 


Action Processing. We distinguish three main components 
on the platform, whose values can be changed: 1) the voltage, 
2) the separation between the plates, and 3) the area of the 
plates. An action on these components can be conducted in 
a) an opened circuit or b) a closed circuit, with the stored 
energy information display i) on or ii) off. We categorise 
each event involving these main actions by the combination 
of: the action on the component {1), 2), 3)}, the circuit 
state {a), b)}, and the stored energy display {i), ii)}. Any 
other event is categorised as 4) other. 


The sequence of each student s is now composed of chrono- 
logically ordered events (divided into the 13 different cate- 
gories listed above) separated by breaks. The breaks may 
be caused by the student being inactive due to observing 
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Figure 4: Distribution of break lengths in seconds, across all 
students. The bold gray line denotes the 60% threshold. 


the progression of a value, reflecting about an observation, 
or taking notes. Due to its definition as the period between 
a mouse release and a mouse click/press, a break may also 
appear due to logistic reasons, such as moving the mouse 
from one component in the simulation to another compo- 
nent. Indeed, the students’ event sequences consist of many 
short breaks (Fig. 4). Like stop words in natural texts, our 
assumption is that these very short breaks, though very fre- 
quent in our sequences, do not contain much information. 
In fact, our classification over students’ understanding may 
be impaired if those noisy states are not removed, like it 
is the case for sentiment analysis when stop words are not 
deleted [24]. To determine the threshold at which the breaks 
are removed, we plot the distribution of inactivity periods, 
and cut at the elbow of the curve for each student, which 
corresponds to a delimitation at 60%, i.e. for each student 
we keep the top 40% of breaks. We then categorise each of 
our remaining breaks similarly to our main action events: by 
component 5) break, circuit state {a), b)}, and stored energy 
display {i), ii)}, resulting in four different break categories. 


The resulting sequence rs for student s is the chronological 
timeline of the student’s events and breaks, divided into 17 
categories. We refer to this timeline rg as the raw sequence 
of interactions for the rest of the paper. We denote the 
length (corresponding to the total number of interactions of 
student s) of rs with Ns, ie. |rs| = Ns. On average, these 
sequences have a length of N, = 67.86 + 42.56. In terms of 
seconds, the sequences rs lasted on average 512.18 + 435.57 
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Figure 5: Timeline of an exemplary student for each class la- 
bel displaying the chronological sequence of interactions with 
the three main components of the simulation. The green and 
orange bars indicate whether the student displayed the capac- 
itance and stored energy. The background indicates whether 
the interactions were conducted in a closed (grey) or open 
(white) circuit. 


seconds. We also introduce the notion of time, which we de- 
fine related to a student’s interactions: at time ¢ the length 
of the raw sequence of interactions for student s is t, i.e. 
|rs,t| = ¢t. We denote the maximum time of student s with 
T;, corresponding to the full sequence rs. Figure 5 visualises 
the timelines for an exemplary student of each class label. 
It can be observed that for these examples, certain aspects 
of conceptual understanding could be inferred by visual in- 
spection (e.g. the capacitance student never activated the 
check box to visualise the stored energy). However, other 
differences in conceptual understanding are more difficult to 
detect by humans (e.g. the differences between the correct 
and areasep students). 


Feature Creation. Next, we transform the interactions in 
each sequence to obtain three different types of features: 
Action Counts, Action Span, and Pairwise Embeddings. 


To obtain the Action Counts features F'4c,s for a student s, 
we first transform each interaction within the raw sequence 
rs in a one-hot encoded vector, resulting in a 17-dimensional 
vector hg; for each interaction 7 and hence, a sequence of 
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vectors H, = {h.;} with i = 1,...,Ns. To compute fac,s.t 
for student s at time step t, we compute the average over 
hg; with i = 1,...,t: facst = $e hs. By using this 
aggregating technique, our features are translated to the (av- 
eraged) number of times each student has interacted in each 
of our categories. Therefore, for each student s, we end up 
with a feature set Fac,s = {fac,s} with t = 1,...,Ts. 


The computation of the Action Span features F'4s,s for a stu- 
dent s is very similar. Rather than looking into the number 
of times a student s has interacted with each of our compo- 
nents in a particular state, we look into the amount of time 
(in seconds) s has spent in each of the categories. We first 
transform every interaction i within rs into a 17-dimensional 
vector hg ;: This vector is 0 for all dimensions but the di- 
mension d corresponding to the category the i, interaction 
belongs to. This representation is similar to a one-hot en- 
coded vector, but instead of filling in a 1 at dimension d, 
we fill in the duration (in seconds) of interaction i. This re- 
sults in a sequence of vectors Hs = {hs} with i = 1,..., Ns. 
We then compute the feature vector fass+ for student s at 
time ¢ in two steps. In a first step, we again average over all 
vectors hg; up to time ¢, leading to fas st = ey hg i. 
We then normalise fasst to obtain fas.st. By using this 
aggregation technique, our feature vector fas js. represents 
the relative amount of time student s has spent in each cat- 
egory up to time t. For each student s, we end up with with 
a feature set Fas,s = {fas,s,t} with ¢ = 1,..., Ts. 


The third feature, Pairwise Embeddings, is fundamentally 
different from the the two other features: we replace each in- 
teraction in the raw sequence by an embedding vector which 
we obtain by training a pairwise skip-gram [25]. The archi- 
tecture of such a network consists of two dense layers: an 
embedding layer followed by a classification layer. Usually 
applied to natural language applications (NLP), its primary 
goal is to predict the context of a word. Here, our pairwise 
skip-gram attempts to predict the behavior of a student in 
the simulation before and after performing a specific inter- 
action. The skip-gram model can be formulated as: 


p = softmax(W2 - (W1 - a)) (1) 


It takes a, an interaction we wish to predict the context of 
as an input, and outputs p, a probability vector which con- 
tains the likelihood of a being surrounded by each possible 
interaction. W, and W2 represent the weight matrices (em- 
beddings). For each interaction a, we feed 2-w pairs into 
the network, where w is the so-called window size of the 
model (context). The first element of each pair is a. The 
second element of each pair, the ground truth label, is one 
of the w interactions preceding or following a. For example, 
a window size of w = 2 would yield the following pairs for 
action a: (a, a 2); (a, a 1), (a, a4 ne (a, a4 2). 


In our case, to obtain the set of pairs J; for a student s, 
we first again transform each interaction within the raw se- 
quence rs in a one-hot encoded vector, resulting in a 17- 
dimensional vector hs; for each interaction 7 and, hence, 
a sequence of vectors H; = {hsj;} with 7 = 1,..., Ns. We 
then build the input pairs for each interaction i, i.e. [is = 
{(hs,i, hsj)} with 7 € {—w,...,w}\0. We obtain the set of 
pairs for all students as J = {I;} with s = 1,..,S, where S 
is the total number of students. 
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Figure 6: Skip-gram architecture. After training of the skip- 
gram, the corresponding row of the weight matrix W  repre- 
sents a structure preserving embedding of action a. 


After training the skip-gram model on J, we build the fea- 
tures F'pw,s for each student s. The weighs W;, of the hidden 
layer represent a structure preserving embedding of the in- 
teractions. In our case, Wi has dimensions 17 x D (D is the 
embedding dimension). For each interaction 7 of student s, 
we get the corresponding row r in Wi, ie. rs; = Wi - hg i. 
This again results in a sequence of vectors Rs = {rs,i} with 
i = 1,...,Ns. To compute fpw,s, for student s at time 
step t, we compute the average over rs with 7 = l,...,t: 
fpws,t = + yk rs,i- Therefore, for each student s, we end 
up with a feature set Fpw,s = {fpw,s,t} with t = 1,...,Ts. 


Classification. To perform the classification task, we explore 
two different approaches: Random Forests and Fully Con- 
nected Deep Neural Networks. 


Random Forests (RFs) are simple, yet powerful machine 
learning algorithms. They consist of an ensemble of decision 
trees, each trained on a different subset of samples and a dif- 
ferent subset of features. The decisions of each tree are then 
aggregated to determine the final prediction of a sample. 
The strength of this method is that overfitting is prevented 
through the randomisation of training samples and features 
during the training of each tree and that the strengths of 
several good classifiers are exploited. While RF classifiers 
are well tested and efficient to train, they require the input 
features to have the same dimension for every sample. We 
therefore train separate RF models for each time step t. The 
input features for the RF model for time step t are {fus,}, 
with M € AC,AS,PW and s = 1,...,S, where S denotes 
the number of students. The output of the RF is a vector 
PRF,M,s,t Of dimension C' (with C denoting the number of 
classes) for each student s, which represents the probability 
of each class. 


Neural Networks (NNs) were built with the idea of emu- 
lating neurons firing in our brain: their nodes are to the 
neurons what their edges are to their axons. The advantage 
of those deep networks is that they are able to model non- 
linear decision boundaries. However, the back propagation 
calculations make them relatively slow to train. In this work, 
we use a Fully Connected Deep Neural Network consisting 
of d hidden dense layers and one classification layer with a 
softmax activation. Similar to RFs, our NN model requires 
features to have the same dimension for all the samples. We 
therefore also train the NN models for fixed points in time. 
The input features for the NN model for time step ¢ are 
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{fst}, with M € AC,AS,PW and s = 1,...,S, where S 
denotes the number of students. Due to the softmax activa- 
tion, the output of the NN is a vector pNn,m,s,t of dimension 
C (with C denoting the number of classes) for each student 
s, which represents the probability of each class. 


4. EXPERIMENTAL EVALUATION 


We evaluated the predictive performance of our classifica- 
tion pipeline on the collected data set (see Section 2). We 
conducted experiments to compare the performance of dif- 
ferent model and feature combinations for the students’ full 
data sequences as well as for early classification using partial 
sequences of the data, to answer our research questions. 


Experimental Setup. We applied a train-test setting for all 
the experiments, i.e. parameters were fitted on a training 
data set and performance of the methods was evaluated on 
a test data set. Predictive performance was evaluated using 
the macro-averaged area under the ROC curve (AUC). We 
used the AUC as a performance measure as it is robust to 
class imbalance. 


We performed all our experiments using different levels of 
detail for the classification task. As ground truth, we used 
the class labels presented in Fig. 2. Given the hierarchical 
nature of the decision tree separating the students in classes 
based on their conceptual knowledge, we performed the clas- 
sification task focusing on three different levels of detail: 


e 2-class case: starting at the root of the decision tree 
(see Fig. 2), we divide students into two classes based 
on their understanding of the circuits: both (98 stu- 
dents) and closed (94 students). 


e 3-class case: going one step down in the hierarchy of 
the tree, we further divide the left branch of the tree 
(see Fig. 2) based on whether the students have com- 
pletely understood all concepts (leading to a correct 
answer in our ranking task) or not. We therefore ob- 
tain three different classes: correct (38 students), ar- 
easep (60 students), and closed (94 students). 


e 4-class case: here, we also split the right branch of 
the tree (see Fig. 2) and divide the students into two 
groups based on whether they ranked the configura- 
tions in the task based on capacitance, resulting in four 
classes: correct (38 students), areasep (60 students), 
capacitance (47 students), and other (47 students). 


For each of those three cases, we trained two types of classi- 
fiers (RF and NN) on our three different feature types (Ac- 
tion Counts - AC, Action Span - AS, Pairwise Embeddings 
- PW), using a stratified 10-fold nested cross validation. We 
kept the folds invariant across all experiments and strati- 
fied over the classes (according to the class labels of the 
4-class case). Because of class imbalance, we used random 
oversampling for the training sets. We used a nested cross 
validation to avoid potential bias introduced by estimating 
model performance during hyperparameter tuning. This al- 
lowed us to tune the hyperparameters within the training 
folds (by further splitting them) and hold out the test sets 
for performance evaluation alone. 
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Figure 7: AUC for 4-class, 3-class and 2-class cases using dif- 
ferent model and feature combinations. Predictions are made 
at the end of the interaction with the simulation, i.e. based 
on the complete sequential interaction data of the students. 


For RF models, we tuned the following hyperparameters 
using a grid search: number of trees [5, 7, 9], number of 
features used at each decision level [‘auto’, ‘all’], number 
of samples [bootstrap resampling of training size samples 
and balanced subsamples]. NN models were implemented 
using the scikit-learn library, trained for 300 epochs, and 
optimised for the log-loss function with the following hy- 
perparameters: learning rate [‘adaptive’, ‘invscaling’], initial 
learning rate [0.01, 0.001], solver [‘adam’, ‘sgd’], hidden layer 
sizes and number [(32, 16), (64, 32), (64, 32, 16), (128, 64, 
32, 16)], and activation function [‘relu’, ‘tanh’, ‘identity’]. 


The skip-gram model providing our pairwise embedding fea- 
tures was implemented using the TensorFlow package. We 
trained the model for 150 epochs, with a window size of 
w = 2, a batch size of 16, and an embedding dimension of 
15. We used categorical cross-entropy as the training loss. 
Because of its unsupervised nature, we trained the model on 
our whole dataset. 


Offline Classification. In a first experiment, we were inter- 
ested in assessing whether it is possible to associate students’ 
behaviour in the simulation with their conceptual under- 
standing achieved through the learning activity. This will 
be referred to as an offline classification task, since we are 
using students’ complete interaction sequences rs. The pre- 
dictive performance in terms of AUC for the three classifi- 
cation problems (4-class case, 3-class case, and 2-class case) 
with a distinction between different model and feature com- 
binations is illustrated in Fig. 7. 


The results of this first experiment showed that for the 2- 
class case, all combinations of models and features reached 
very high average performances as quantified by their AUC 
scores (value range: 0.95 — 0.97). The best mean score 
was achieved by the combination of NN with PW features 
(AUCNN,pw = 0.97). However, it should be noted that 
the performance differences between the combinations were 
comparatively small. Using a one-way ANOVA, no statisti- 
cally significant differences were found between the different 
groups (F'(5,54) = 0.839,p = 0.528). It seems that for 
this rather rough classification into groups of students who 
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Figure 8: AUC for 4 classes, 3 classes, and 2 classes using different model and feature combinations. Predictions are made over 
time, stopping at time step t = 150 as only very few students have longer interaction sequences. 


only understood the closed circuit and those who understood 
both the open and closed circuits, the different combinations 
of models and features perform equally well and with a high 
predictive accuracy. 


By extending the classification task to the 3-class case, the 
AUC scores dropped in comparison to the 2-class case (value 
range: 0.86 — 0.90). The lowest score was observed for RF 
with AC (AUCRr,ac = 0.86), while best performances were 
observed for RF with AS (AUCRr,as = 0.90), NN with AS 
(AUCwn,as = 0.89), and NN with PW (AUCNN, Pw = 
0.89). Similar to the 2-class case, no statistically significant 
differences were found between the different groups using a 
one-way ANOVA (F'(5,54) = 0.740, p = 0.597). This result 
illustrates that further dividing those who understood the 
functioning of the capacitor in both the open and closed 
circuit had a similar impact on the predictive performance 
of all model and feature combinations. 


Finally, when evaluating the 4-class case, the most complex 
classification task, the mean AUC scores further dropped for 
all combinations (value range: 0.78 — 0.84). The lowest per- 
formance was again observed for RF with AC (AUCrr, ac = 
0.78). Introducing the fourth class to the classification prob- 
lem seemed to have a smaller negative impact on the AUC 
scores for NN with AS (AUCwn,as = 0.83) and NN with 
PW (AUCwn,pw = 0.84), which obtain the best perfor- 
mances for the 4-class case. This observation was partially 
confirmed by a one-way ANOVA that showed a trend to sta- 
tistical significance for differences between the combinations 
(F(5, 54) = 1.989, p = 0.095): as we increase the amount of 
classes, the p-value decreases. 


The results of this first experiment show that it is possi- 
ble to perform an offline prediction of students’ conceptual 
understanding in the capacitor ranking task (see Section 2) 
based on the different combinations of models (NN or RF) 
and feature generation methods (AC, AS or PW) proposed 
in this work. From the entire data sequences of students’ 
interactions with the simulation, we observed that predic- 
tive performance generally decreased when the complexity 
of the classification task was increased. While all combina- 
tions showed very good performances for the coarse classi- 
fication of the 2-class case, AUC scores started to diverge 
more among combinations for the 3- and 4-class cases. Es- 
pecially for the 4-class case, where differences became more 


visible with certain combinations showing a trend of better 
predictive performance (i.e. NN with PW and NN with AS) 
as compared to others (e.g., RF with AC). 


Predicting over Time. In a second experiment, we were in- 
terested in assessing, whether we could predict students’ 
conceptual knowledge for shorter interaction sequences, i.e. 
when not using students’ full sequences, but only the first 
t interactions. For all three classification cases (2-class, 3- 
class, and 4-class), we trained all model (RF, NN) and fea- 
ture (AC, AS, PW) combinations for t = 10, 20, ..., 150 time 
steps. As described in Section 3, to compute the features 
fac,s,t, fas,s,t, and fpw,s, for a student s at time step f, 
we only use the student’s interactions 7 up to that point in 
time, ie. 7 = 1,...,t. Similarly, at each time step t, the 
models are exclusively used to predict students whose in- 
teractions sequences contain a minimum of ¢ elements (i.e. 
N; >t). For students with shorter interaction sequences 
(i.e. Ns < t), the last available prediction will be used. For 
example, for a student with 30 interactions (N; = 30), we 
would make the first three predictions using the models for 
t = 10, t = 20, and t = 30 time steps. For the remain- 
ing time steps, the predictions from t = 30 will be carried 
over. We chose this approach for predicting because we as- 
sume that the student will leave the simulation after N, 
interactions and it therefore does not make sense to update 
the prediction afterwards. Figure 8 illustrates the predic- 
tive performance in terms of AUC for the different model 
and feature combinations and all classification cases. 


As expected, for the 2-class case, all models achieve a high 
performance for long interaction sequences. The AUC of all 
NN models is larger than 0.9, starting at time step t = 70. 
Generally, the difference between the models and feature 
combinations for t > 70 is small. We also observe some 
model differences for earlier time steps, where the RF model 
with the action span features performs better than the other 
models. It achieves an AUC larger than 0.8 already at 
time step t = 30 (AUCrr.as = 0.82). Moreover, the NN 
model with action span is close to that performance at time 
step t = 30 (AUCwn,as = 0.79). Naturally, predicting at 
even earlier time steps is more difficult, but some of the 
models achieve a decent AUC of 0.7 already after observ- 
ing 20 interactions with the simulation (AUCRr,as = 0.74, 
AUCRr,ac = 0.73, AUCwn,as = 0.71). 
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Figure 9: Evolution of confusion matrices over time steps for the 3-class case: NN with PW (top) and RF with AS (bottom). 


For the 3-class case, the performances of all algorithms have 
decreased from time step t = 10 compared to the above prob- 
lem. This is not surprising, as differentiating the correct stu- 
dents from the areasep students is not as straightforward as 
separating students who did not interact in the open circuit 
from the rest. Though, in the 2-class cases, all performances 
were close to one another, we notice that for the 3-class 
problem at time step t = 40, three model-features combina- 
tions take the lead (AUCwn,pw = 0.75, AUCRe, as = 0.74, 
AUCwn,as = 0.74), while three fall behind (AUCRrr,pw = 
0.7, AUCRr.ac = 0.69 and AUCNn,ac = 0.68) until t = 70, 
where the NN with PW outperforms all other model and 
feature combinations with an AUC of 0.87. 


The variance across model-feature performances is larger in 
the 4-class case. From time step t = 20 already, RF with 
AS outperforms all other combinations, reaching an AUC 
of 0.8 at time step t = 100. Similarly to the 3-class prob- 
lem, the same three models take the lead, while the others 
fall behind from time step t = 70 (AUCNN,pw = 0.75, 
AUCRr,as = 0.78, AUCnn,as = 0.76 and AUCRr,pw 
0.72, AUCRr,ac = 0.72 and AUCwn,ac = 0.73). 


Given the fact the predictive performance in terms of AUC 
seems to be similar for the best performing models, we per- 
formed a more detailed analysis to assess how different the 
predictions of the several model and feature combinations 
were. In a real-world application (intervention setting), we 
would probably use a 2-class classifier to identify a coarse 
split (between classes both and closed) already at a rela- 
tively low number of time steps and use a 3-class method 
to provide a more detailed prediction later on. With the 
current model performance for the 4-class case, usage of a 
more detailed classifier seems not practicable. We there- 
fore investigate the predictions of the two best models for 
the 3-class case: NN with PW and RF with AS. Figure 9 
shows the confusion matrices for these two models for an 
increasing number of time steps. We do not show results 
for t > 100, as the predictive performance of the models 
does not improve much anymore for longer sequences (see 
also Fig. 8). While both models have a very similar predic- 
tive performance in terms of AUC up to time step 60, we 
can already see that the models evolve differently in terms 
of prediction. At time step t = 40, the RF model is al- 
ready very accurate in detecting students from class correct 
(p(correct|Ctrue = correct) = 0.78), while the NN model is 
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less confident (p(correct|Cirue = correct) = 0.67). On the 
other hand, the NN model is already more accurate in identi- 
fying students from class closed (p(closed|Ctrue = closed) = 
0.66), while the RF model cannot identify students from this 
class well (p(closed|ctrue = closed) = 0.42). At time step 
t = 60, both models are almost equally accurate in iden- 
tifiying students from class correct (NN: p(correct|Ctrue = 
correct) = 0.76, RF: p(correct|ctrue = correct) = 0.80). 
The NN model is still better at classifying students from the 
class closed. Both models have trouble with correctly identi- 
fying students from class areasep. While the NN model tends 
to assign these students to class closed (p(areasep|Ctrue = 
closed) = 0.52), the RF model is becoming better at cor- 
rectly assigning them (p(closed|Cirue = closed) = 0.42). 
These observed trends continue to get stronger with an in- 
creasing number of time steps. At ¢ = 100, the NN model 
is very accurate when it comes to classifying students from 
classes correct and closed. Students from class areasep have 
only a 35% chance of being correctly classified and a 55% 
chance of getting assigned to class closed. In practice, this 
would mean that 55% of the students would get more in- 
tervention (hints) than necessary. The RF classifier is also 
very accurate in detecting students from class correct, but 
is, however, not able to distinguish between students from 
class closed and class areasep. In practice, this would mean 
misclassified students from class closed would get less help 
than necessary and misclassified students from class areasep 
would get more help than necessary. 


This experiment shows that we can (coarsely) classify stu- 
dents after observing a relatively low number of interactions. 
For the 2-class case, the AUC of the best model (RF with 
AS) is larger than 0.8 after t = 30 time steps. Naturally, 
the classification task is more complex for the 3-class and 
the 4-class cases. The best model on the 3-class case (NN 
with PW) achieves an AUC close to 0.8 at time step 50. 
The second analysis demonstrates that achieving a similar 
predictive performance in terms of AUC does not imply the 
same classification behaviour, i.e.the best models on the 3- 
class case (NN with PW and RF with AS) have different 
strengths. It has, however, one important limitation: stu- 
dents spend different amount of times on the simulation and 
therefore, the length of their interaction sequences varies. 
There are for example students with 80 interactions and 
other students with only 50 interactions. Performing the 
classification task at time step 40 is early for a student with 
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Figure 10: AUC on the 2-class, 3-class and 4-class cases for different model and feature combinations. Predictions were made 
for 25%, 50%, 75% and 100% of the total number of interactions for each student. 


a total of 80 interactions. For a student with only 50 inter- 
actions in total, the prediction at time step 40 comes almost 
at the end of the interaction time with the simulation. 


Online (Early) Classification. The third and last experiment 
addresses the limitations of the previous experiment. In this 
experiment, we were interested in assessing the ”early” pre- 
dictive performances of the different models using only a 
part of a student’s sequence. Given the fact that the length 
of interaction sequences varies over the students, we did not 
align the sequences by absolute time steps, but by percent- 
ages of interactions. Specifically, we aimed at making the 
prediction for a student after having seen 25%, 50%, 75%, 
and 100% of the interaction sequence of this student. Note 
that this experiment does not require a re-training of our 
models. We just retrieve the predictions of the models for 
the corresponding time step t. For our example student with 
a total of 80 interactions, we retrieve the predictions of the 
models for time steps t = 20, t = 40, t = 60, and t = 80. 
Figure 10 shows the AUC of the models for all classification 
cases, with an increasing number of interactions (in %). 


As expected, all model and feature combinations perform 
well for the 2-class case. To achieve a high classification 
accuracy, we do not need to observe the full sequence of 
a student. For all NN models, obtaining 75% of students’ 
interactions is enough to achieve an AUC of around 0.9. 
With RF, the model using the action span features also ob- 
tains an AUC of more than 0.9 at 75% of the interactions 
(AUCRrr,as = 0.92). The performance of the two other fea- 
ture types is slightly lower (AUCRrr,ac = 0.9, AUCRF,pw = 
0.88). Naturally, predictive performance of all the models 
is lower when observing smaller parts of students’ interac- 
tion sequences. If obtaining only the first 25% of students’ 
interactions, there is more variation in the achieved AUC 
between models, with the best model (RF with AS) achiev- 
ing an AUC of 0.66 and the worst model achieving an AUC 
of 0.57 (RF with PW). It is promising that the best model 
at 50% of the interactions exhibits an AUC of almost 0.8 
(AUCRrr,as = 0.78), which makes it a valuable candidate 
for a coarse early prediction and intervention, i.e. differ- 
entiating between students with a high conceptual under- 
standing (class both) and students with a low conceptual 
understanding (class closed), early on. 


Naturally, performance of the models for the 3-class case is 


overall lower as we are now differentiating the different levels 
of conceptual knowledge in a more fine-grained way. As we 
have seen in Fig. 9), it is difficult to differentiate between 
students from the left branch of the tree (i.e. correct vs. 
areasep in Fig. 2). Performance across models varies more 
for the 3-class case. When observing 75% of students inter- 
actions, the AUC of the worst model (NN with AS) amounts 
to 0.78, while the best model (NN with PW) has an AUC 
of 0.84. We also observe that the NN with PW features is 
consistently the best model, regardless of the amount of ob- 
served interactions, with an increasing gap to the other mod- 
els. At 50% of interactions, the gap in performance among 
the three best models is still small (AUCnn,pw = 0.699, 
AUCRrr,as = 0.7, AUCRr,pw = 0.69). It gets larger at 75% 
(AUCwn,pw = 0.84, AUCRFAs = 0.82, AUCNN As = 
0.78). When observing the complete sequences of the stu- 
dents (ie. 100% of the interactions), all models reach an 
AUC of 0.85 or higher (see also Fig.7). 


Again, the performance decreases when moving to four classes, 
due to the increasing complexity (in terms of level of detail) 
of the classification task. The AUC of the best model (NN 
with PW) amounts to 0.83, while for the worst two models 
AUC = 0.77 (RF with AC, NN with AC). While all the mod- 
els’ AUC is lower than 0.7 when observing only 25% or 50% 
of students’ interactions, interestingly there is a large gap be- 
tween the best model (RF with AS) and all the other models 
(ie. at 25%: AUCRr,as = 0.59, AUCRF,ac = 0.56). 


With this last experiment, we assessed the capabilities of 
our models to make predictions as early as possible during 
interaction with the simulation as a basis for intervention. 
By evaluating the models at different percentages of total 
interactions, we took into account the fact that the defini- 
tion of ‘early’ depends on the student. Our results show that 
after observing the first 50% of interactions, we are able to 
reliably distinguish between students with a high and low 
conceptual understanding gained by the end of the learning 
activity (both and closed). At 75% of interactions, the best 
models are also able to provide a more fine-grained predic- 
tion (correct, areasep, and closed). 


5. DISCUSSION 


Over the last decade, interactive simulations of scientific 
phenomena have become increasingly popular. They allow 
students to learn the principles underlying a domain through 
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their own explorations. However, only few students pos- 
sess the degree of inquiry skills and self-regulation necessary 
for effective learning in these environments. In this paper, 
we therefore explored approaches for an early identification 
of struggling students as a basis for adaptive guidance: we 
aimed at predicting students’ conceptual knowledge while 
interacting with a Physics simulation. Specifically, we were 
interested in answering the following three research ques- 
tions: 1) Can students’ interaction with the data be associ- 
ated with the gained conceptual understanding? 2) Can con- 
ceptual understanding be inferred through sequence mining 
methods with embeddings? 3) Can the proposed methods be 
used for early predicting students’ conceptual understanding 
based on partial sequences of interaction data? 


To answer the first research question, we analysed data from 
192 first-year undergraduate Physics students who used an 
interactive capacitor simulation to solve a task in which they 
had to rank four capacitor configurations by their stored 
energy. Previous research has emphasised the importance 
of aligning instructional and assessment activities to imple- 
ment pedagogically meaningful learning activities with ed- 
ucational technology [26, 27]. Since one objective of auto- 
matically detecting student learning behavior is to provide 
some kind of assessment (either formative or summative), 
the learning task presented in this work was designed to fa- 
cilitate the application of sequence mining. The design of 
this learning task allowed us to relate all of the students an- 
swers to a certain level of conceptual understanding. Using a 
decision tree, we were able to map each ranking to one of the 
four labels representing groups of similar conceptual under- 
standing. Our results show that all evaluated models were 
able to correctly associate students’ sequential interaction 
data in the simulation with the generated labels, achieving 
high predictive performance when fed with full sequences 
(2-class case: AUC > 0.9, 3-class case: AUC > 0.85, 4-class 
case: AUC > 0.75). This high predictive power was also ob- 
served in [22], where they reached an accuracy of 85% when 
separating ‘high’ learners from ‘low’ learners based of full 
interaction sequences. However, despite their findings of a 
potential third cluster, they did not investigate the ternary 
classification task. In this paper, we increase the granularity 
of the labels in order to target more specific shortcomings in 
the knowledge of the students in order to provide them with 
more detailed feedback. We therefore conclude that we can 
answer research question 1) with yes. 


The second research question investigates the benefits of la- 
tent features generated by skip-grams for offline classifica- 
tion tasks in the context of education. Usually applied to 
NLP problems, skip-grams have the ability to learn the con- 
text of a word in an unsupervised fashion. In our case, 
we use it to find the OELE behaviour of the students sur- 
rounding their interaction, and retrieve the embedding ma- 
trix of the neural network to create our latent representa- 
tions. This approach has already been proven efficient to 
analyse student strategies in blended courses [28], but not 
for the identification of conceptual understanding. To eval- 
uate the predictive power of latent feature representations, 
we trained two classifiers (NN and RF) on three types of 
feature (Action Counts, Action Span and Pairwise Embed- 
dings) on the full sequences of students. At first, we notice 
that all model and feature combinations achieve a high AUC 


for all classification tasks (2-class case, 3-class case, and 4- 
class case). Though the ANOVA revealed no significant dif- 
ferences between the predictive performance of models with 
different types of features, we can observe that the NN with 
PW achieves a higher performance on average than all other 
combinations but the NN with AS. What is more, the per- 
formances in its first quartile dominate those from the third 
quartile of three model and feature combinations. Addi- 
tionally, its performance variance is smaller than that of the 
NN with AS. This shows that pairwise embeddings gener- 
ated by a skip-gram approach can be a valuable asset for 
finer-grained classification, even if no statistical difference 
was found with respect to the other model-feature combina- 
tions. We can therefore answer research question 2) with a 
partial yes. 


To address the third research question, we assessed predic- 
tive performance of the proposed approach when only partial 
sequences of the students’ interaction data were observed. 
We analysed the performances of our proposed approaches 
based on varying proportions of the available data and for 
classification tasks with different levels of complexity. The 
results of our experiments show that the proposed combina- 
tions of models and generated features allowed us to predict 
the correct class labels early on. The best models were able 
to reliably predict students’ conceptual understanding for 
the 2-class case (AUC ~% 0.8) after having seen 50% of the 
students’ interaction data. To reach a similar predictive per- 
formance for the more fine-grained 3-class and 4-class cases, 
the best models needed about 75% of the data. The findings 
from these experiments therefore represent a promising step 
towards early prediction of students’ conceptual understand- 
ing in OELEs. We can therefore answer research question 
3) with yes. 


One of the limitations of this work is the unfeasibility to 
track whether students used external resources (other than 
the simulation) in order to rank the four capacitor configura- 
tions. This may bias the inference from the simulation usage 
to the extrapolated understanding level. Furthermore, due 
to our small sample size, we were able to only train shallow 
NN classifiers and skip-grams. Finally, the external valid- 
ity of these experiments remains to be evaluated on other 
interactive simulations and different types of tasks. 


To conclude, the proposed approach represents a promising 
step towards early prediction of students’ learning strategies 
in interactive simulations, that moreover, can be associated 
with their level of conceptual understanding. The proposed 
learning activity seems to represent an interesting example 
for the design of learning tasks in OELEs that facilitates the 
association of detected student strategies with conceptual 
understanding through sequence mining. Future work could 
explore whether such designs could also be used to identify 
conceptual understanding at a more fine-grained level. 
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ABSTRACT 


As students learn how to program, both their program- 
ming code and their understanding of it evolves over time. 
In this work, we present a general data-driven approach, 
named Temporal-ASTNN for modeling student learning pro- 
gression in open-ended programming domains. Temporal- 
ASTNN combines a novel neural network model based on 
abstract syntactic trees (AST), named ASTNN, and Long- 
Short Term Memory (LSTM) model. ASTNN handles the 
linguistic nature of student programming code, while LSTM 
handles the temporal nature of student learning progression. 
The effectiveness of ASTNN is first compared against other 


models including a state-of-the-art algorithm, Code2Vec across 


two programming domains: iSnap and Java on the task of 
program classification (correct or incorrect). Then the pro- 
posed temporal-ASTNN is compared against the original 
ASTNN and other temporal models on a challenging task 
of student success early prediction. Our results show that 
Temporal-ASTNN can achieve the best performance with 
only the first 4-minute temporal data and it continues to 
outperform all other models with longer trajectories. 


Keywords 
Student Modeling in Programming, LSTM, ASTNN 


1. INTRODUCTION 


Learning how to program is like learning how to write in 
a second language. As students learn to author code, both 
their programming code and their understanding of it evolves 
over time. Prior research has either focused exclusively on 
developing accurate linguistic models of their artifacts [30, 
24, 1, 42], or developing temporal models of students com- 
prehension of programming [11, 21, 23]. In this work, we 
propose a general data-driven approach named Temporal- 
ASTNN, which combines a state-of-the-art neural network 
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model based on abstract syntax trees (AST) named ASTNN 
— addressing the linguistic structure of the students’ artifacts 
— along with Long-Short Term Memory (LSTM), which han- 
dles their learning progression. In this way we effectively 
marry both aspects of the process in a single system. 


Much as language is how people communicate, programming 
languages are how we communicate with machines, and var- 
ious natural language processing (NLP) techniques can be 
applied to modeling programming languages [15]. Tradi- 
tional approaches for code representation often treat code 
fragments as natural language texts and model them based 
on their tokens [7, 9]. Despite their simplicity, token-based 
methods omit the rich and explicit structural information 
[25] in student codes. Until recently, deep learning models 
have achieved state-of-the-art results on source code analy- 
sis, including code functionality classification [24], method 
name prediction [1], code clone detection [42] and so on. 
These successful models usually combine Abstract Syntax 
Tree (AST) representations with various neural networks to 
capture the structural information from the programming 
language. Their impressive performance shows that by ad- 
dressing the linguistic structural nature of code, syntactic 
knowledge is indeed important to learn meaningful code rep- 
resentation. 


On the other hand, modeling student learning progression 
in open-ended programming environments is also a type of 
student modeling. Generally speaking, student modeling 
has been widely applied to predict the student’s future per- 
formance based on historical data. For well-defined learn- 
ing environments, student models usually monitor students’ 
learning progress (correct or incorrect) over time to infer 
their knowledge states, such as Bayesian Knowledge Trac- 
ing (BKT) [8] and Deep Knowledge Tracing (DKT) [29]. 
When it comes to open-ended programming environments, 
student modeling becomes much more challenging because 
1) the correctness evaluation concerning each step taken by 
students will not be available, and 2) it is extremely hard to 
represent student states. As a result, prior research either 
has focused on utilizing other features such as hint usage, 
interface interactions to evaluate student learning outcomes 
[11], or creating meaningful states by transforming student 
click-like log files into fixed feature sets for various student 
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modeling tasks [21]. While such prior work is able to capture 
the temporal information from historical data, it ignores the 
linguistic, structural property of student code. As an accu- 
rate student model is a building block for any educational 
system that provides adaptivity and personalization, it is es- 
pecially important to model student learning progression in 
open-ended programming tasks by addressing both linguis- 
tic and temporal characteristics in student code sequences. 


In this work, we present a data-driven approach named 
Temporal-ASTNN to model student learning progression in 
open-ended programming domains. Temporal-ASTNN con- 
sists of two main modules: 1) ASTNN [42] for code represen- 
tation learning, which can handle the linguistic structure of 
student code, and 2) LSTM [16] for temporal learning, which 
handles the temporal nature of student learning progression. 
In order to explore the effectiveness of our model, we focus 
on two types of student modeling tasks. One is the task 
of program classification (correct or incorrect), in which the 
effectiveness of ASTNN is compared against other models 
including a state-of-the-art algorithm, Code2Vec [1] across 
two programming domains: an open-ended block-based pro- 
gramming environment named iSnap and a textual program- 
ming environment for the Java programming language. The 
other is the task of student success early prediction in which 
the effectiveness of temporal-ASTNN is compared against 
the original ASTNN and other models integrating with dif- 
ferent feature embeddings on iSnap only because it has tra- 
jectories of student codes. 


Our main contributions are: 1) To the best of our knowledge, 
Temporal-ASTNN is the first model to address both linguis- 
tic and temporal properties of student learning progression 
in programming tasks; 2) We explored the robustness and 
the effectiveness of our model on student success early pre- 
diction task and compared it with state-of-the-art temporal 
models; and 3) We evaluated the effectiveness of ASTNN 
against Code2Vec and various baseline models on student 
program classification tasks across two domains, while most 
prior research mainly focused on classic tasks of professional 
source code analysis instead of novice programming. 


The remainder of this paper is structured as follows. Sec- 
tion 2 presents the methods. Section 3 and 4 describe the 
two types of programming tasks together with experimental 
settings and results. Section 5 presents the related work. 
Finally, we discuss and conclude our work in Section 6. 


2. METHODS 


Problem Definition: For the task of student program classi- 
fication, our dataset can be represented (X,Y) = {(x',y'), 
(x,y), .., (a@%,y%)} where N is the total number of codes 
in the dataset where x’ represents a code snippet of student 
i and binary y’ indicates whether the code is correct or not. 


For the task of student success early prediction, our dataset 
can be represented as X = {a',a?,...,2”}, where M is the 


number of students. For a given student k, «* = {a?,..., wT, }, 


where «} represents student k’s code at time step t in «* and 
T; is the total number of codes in the student k’s learning 
trajectories which varies with different students. For each 
a*, we are provided with the outcome label y” for the out- 
come of the sequence of codes. y* = 0 indicates the student 


Example Code 


snapshot { 
down 
doRepeat([literal=10], script { 
forward([literal=100]) 
3) 
t 


i 


Figure 1: An examaple of iSnap code and the AST represent- 
ing its syntactic structure. Red highlights a sample path, and 
blue highlights a sample ST-tree. 


k succeeded, otherwise y” = 1. The goal of student success 
early prediction is to predict the y* using the student’s codes 
from the beginning up to the certain minutes: 2%, x%,..., 2%. 
For simplicity, we omit index k hereinafter when it does not 


cause ambiguity. 


2.1 Temporal-ASTNN 

Figure 2 shows the detailed structure of Temporal-ASTNN. 
Fundamentally, it contains a ASTNN which learns the em- 
bedding for student code and a LSTM layer which han- 
dles the temporal aspect. It is important to note that in 
Temporal-ASTNN, the two modules interact with each other 
to control how information flows. 


Sigmoid 


{Linear Layer | 


(ist 


( Max Pooling Layer 
i 


i et 
a= en aS 


Statement Encoder 


Figure 2: Temporal-ASTNN model structure: the output of 
ASTNN connects to the input of LSTM. 


2.1.1 ASTNN 

ASTNN is one of the state-of-the-art methods in source code 
analysis, and it’s main idea is to learn a vector for the code 
through statement-level ASTs. Specifically, we split the 
large AST of a code fragment by the granularity of state- 
ment and extract a sequence of statement trees (ST-trees) 
via pre-order traversal. As shown in Figure 1 (highlighted in 
blue), we can get a ST-tree rooted at forward, whose child 
is literal and grandchild is 100. In this way, we will get a 
sequence of ST-trees from the original AST, and feed them 
as the raw input of ASTNN. As shown in Figure 2, ST-trees 
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Encoding 


Oe 


Layer 


6008 6008 608 


Figure 3: Statement Encoder of ASTNN, composing by an 
Embedding layer, an Encoding layer and a max-pooling layer. 


will first pass Statement Encoder, then go through Bidirec- 
tional GRU (Bi-GRU) [2], finally pass max-pooling layer to 
get the vector for code representation. 


Statement Encoder: Figure 3 shows the detailed structure of 
statement encoder. Assuming that there are J total nodes 
in a ST-tree s;, for each node nj € s;, j € [1, J], it will 
first go through the embedding layer to get initial embed- 
dings v? = Wembea | 77, where Wembea € R"™? is the 
pre-trained embedding matrix, V is the vocabulary size and 
d is the embedding dimension. Then the vector will be up- 
dated through a Recursive Neural Network [35] based en- 
coder layer: h? = o(Wencode | v7 + >> heniia + 63). Here 
Wencode € R?** is the encoding matrix and k is the encod- 
ing dimension. b is the biased term and a is the activation 
function, in this work we followed the original paper to set 
o as identify function. After recursive optimization of the 
vectors of all node in the ST-tree, we sample the final rep- 
resentation e; via a max-pooing layer: 

e; = [max(h?"), max(hi?),..., max(h2*)],7 € [1,J] (1) 


Code Representation: For a set of ST-trees (s1, s2,...,8z), 
where L is the number of ST-trees in the AST, our goal is to 
get a code vector z as final representation. After generating 
a sequence of vectors (€1, €2, ..., ez) from Statement Encoder, 
we will apply Bi-GRU to track the naturalness of statements 
sequence: 


hi = [GRU (ci), GRU (e:)],¢ € (1, Z] (2) 
The statement representation h; € R”*?™, where m is the 
embedding dim of Bi-GRU. Finally, similar to Statement 
Encoder, a max-pooling layer is used to sample the most im- 
portant features on each of the embedding dimension. Thus 
we get z € R?’", which is treated as the final vector repre- 
sentation of the original code fragment. 


In original ASTNN, we can add another linear layer to di- 


rectly fit z to the following prediction tasks. While in Temporal- 


ASTNN, z will be used as the input for LSTM memory cell. 


2.1.2 LSTM 

As shown in Figure 2, at each time step t, the output of 
ASTNN z, will be used as the input for LSTM cell. Once 
ASTNN generates the code representation by learning the 
linguistic nature from code 2: 


z, = ASTNN(z:) (3) 


LSTM is trained utilizing input vector z; to handle the tem- 
poral information. There are three major components: a 
forget gate, an input gate, and an output gate in a LSTM 
memory cell. 


Forget Gate: In the first step, a function of the previous 
hidden state ht_1 and the new code input z: passes through 
the forget gate, indicating what is probably irrelevant and 
can be taken out of the cell state. The forget component 
will calculate a weight f; between 0 to 1 for each element in 
hidden state vector Cy-1. Here Wy and by are the weights 
and bias for the forget component. 


fe = sigmoid(Ws - [hi—1, 21] + bf) (4) 


Input Gate: There are two steps involved in input compo- 
nent’s calculation. In the first step, a tanh layer calculates 
a candidate vector C; that could be added to the current 
hidden state. In the second step, the input components cal- 
culate a weight vector i¢ (ranging from 0 to 1) to determine 
to what extent Ct should update the current memory state. 


Ct = tanh(We : [he-1, Zt] + be) (5) 
Ut = sigmoid( Wj; : [At-1; Zt] + bi) 


Output Gate: The output component is simply an activation 
function that filters elements in memory cell state C;, where 
Ch = Cri: fe +Ce- te. It calculates a weight vector to 
determine how much information is allowed to be revealed: 


ot = sigmoid(Wo - [he-1, 2t] + bo) (6) 


Finally we get the output of time t: hi = o; x tanh(C;). In 
this work, we used the last-step output from LSTM as the 
temporal representation of student code sequence. 


2.1.3 Temporal-ASTNN: Truncated vs. Entire 
As shown in Figure 2, by combining ASTNN and LSTM, 
the final Temporal-ASTNN can be described as: 


214000 ST = ASTNN(a1, seey xT) 
hr => LSTM(z1, wee 27) (7) 
9 = sigmoid(Wyhr + 67) 


where ¥ is the output from Temporal-ASTNN, Wy, is the 
weight matrix b; is the bias term for the liner layer. The 
entire Temporal-ASTNN framework is learned by optimizing 
ASTNN and LSTM parameters spontaneously. They are 
optimized by minimizing the binary cross-entropy: 


L(G, y; 9) = —(y log(y) + 1 — y) log(1 — )) (8) 


Prior research on applying ASTNN for source code analysis 
only used one snippet of code fragment to extract mean- 
ingful representation for following machine learning tasks. 
However, when combining ASTNN with LSTM on student 
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programming sequences such as iSnap for early prediction, 
we have the choices of either using the truncated training 
sequences or using the entire sequences. The advantage of 
using truncated sequences is that the training data would 
be more similar to the testing data and thus, the learned 
representations are more likely to emerge and be represen- 
tative for the early success task. On the other hand, the 
advantage of using entire sequences is that the longer the 
sequences, the more meaningful AST patterns can be consid- 
ered and discovered. Thus, we explored Temporal-ASTNN 
using both the entire and the truncated sequences for rep- 
resentation learning and referred as Temporal-ASTNNoayunc 
and Temporal-ASTNN respectively. 


Entire 


2.2 Code2Vec 

Code2Vec [1] leverages different features and model struc- 
tures, and focuses on the dependency of distant components 
in code structures to achieve code classification tasks. As 
with ASTNN, Code2Vec is designed to address the linguis- 
tic structure of programming languages. Fundamentally, 
there are two main differences between these two models: 
1) ASTNN takes a set of statement-level ASTs as inputs, 
while Code2Vec utilizes the syntactic paths of ASTs to learn 
the representation (an example path is shown in Figure 1 in 
red). And 2) After encoding the vector representations of 
ST-trees, ASTNN uses Bi-GRU to handle the sequence of 
vectors; while Code2Vec utilizes an attention mechanism to 
learn a weighted average of path vectors and thus to produce 
the final code representation. With the vector representing 
code, Code2Vec can also be used for various prediction tasks. 


3. STUDENT PROGRAM CLASSIFICATION 


In the task of student program classification, we aimed to 
predict the correctness (correct or incorrect) of student sub- 
mitted code. The effectiveness of ASTNN is compared against 
Code2Vec and other token-based models across two pro- 
gramming domains: iSnap and Java. 


3.1 Datasets 
3.1.1 iSnap 


iSnap is an extension to Snap! [13], a block-based pro- 
gramming environment, used in an introductory computing 
course for non-majors in a public university in the United 
States [32]. isnap extends Snap! by providing students with 
data-driven hints derived from historical correct student so- 
lutions [31]. In addition, iSnap logs all students actions while 
programming (e.g. adding or deleting a block), as a trace, 
allowing us to detect the sequences of all student steps, as 
well as the time taken for each step. In this work, we focused 
on one homework exercise named Squiral, derived from the 
BJC curriculum [13]. In Squiral, students are asked to write 
a procedure that draws a square-like spiral. As shown in 
Figure 4, correct solutions require procedures, loops, and 
variables using at least 7 lines of code. We collected stu- 
dents’ data for Squiral from Spring 2016, Fall 2016, Spring 
2017, and Fall 2017. We excluded students who requested 
hints from iSnap to eliminate factors that might affect stu- 
dents’ problem-solving progress, leaving a total of 65, 38, 29, 
and 39 student code traces from each semester, respectively. 


The data collected from iSnap consists of a code trace for 
each student’s attempt. This code trace represents a se- 


quence of timestamped snapshots of student code. In prior 
research, an expert feature detector has been proposed to 
automatically detect 7 expert features of a student snap- 
shot [43]. Those expert features are binary and indicate 
whether the corresponding feature presents or not. We ran 
the expert-feature detector to tag each snapshot in all 171 
code traces, making a total of 31,064 tagged snapshots. With 
the temporal sequences, iSnap data is evaluated not only on 
this classification task, but also on the temporal early pre- 
diction task as described in Section 4. 
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Figure 4: The iSnap interface, with the blocks palette on the 
left, the output stage on the right, the scripting area in the 
middle, and the hints button on top. 


3.1.2. CodeWorkout 

CodeWorkout! is an online and open system for program- 
ming in Java. It provides a web-based platform on which 
students from various backgrounds can practice program- 
ming and instructors can offer courses [10]. Different from 
iSnap, CodeWorkout doesn’t log students’ traces during pro- 
gramming but only their submissions. In this work, we fo- 
cused on one programming exercise named isEverywhere, 
where the knowledge of loops and array will be mainly eval- 
uated. In isEverywhere, students are asked to write a Java 
function to check if a value is “everywhere”, that is in the 
given array if the value exists for every pair of adjacent ele- 
ments. As shown in Figure 5, the system will show detailed 
feedback regarding the student’s submission, indicating how 
it failed/succeed on the corresponding test cases. 


X45: isEverywhere 


We'll say that a value is “everywhere* in an array if for every pair of adjacent elements in the array, at least one of the pair is that value. Return true if the given value is everywhere in the array. 


Your Answer: Feedback 
boolean (int[] nums, int val) Result Behavior 
(int i as nums.length; i++) e isEverywhere({1, 2, 1, 3}, 1) 
ArraylndexOutOfBoundsException: 4 
(nums [i ] val && nums[i] != val) 

: isEverywhere({1, 2, 1, 3}, 2)-> false 
isEverywhere({1, 2, 1, 3, 4}, 1) -> false 
isEverywhere({2, 1, 2,1}, 1) -> true 
= isEverywhere({2, 1, 2, 1}, 2) 

ArrayIndexOutOfBoundsException: 4 
Check my answer! Reset isEverywhere({2, 1, 2, 3, 1}, 2) -> false 
S isEverywhere({3, 1}, 3) 


Practice a different Java exercise ArraylndexOutOfBoundsException: 2 


Figure 5: The CodeWorkout interface, with the problem de- 
scription on the top, the coding area in the middle, and the 
feedback on the right. 


The data collected from CodeWorkout is in Progsnap2 [33] 
format, and consists of two semesters: Spring 2019 and Fall 


"https: //codeworkout.cs.vt.edu/ 
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2019. Similar to iSnap, we processed the data to eliminate 
factors that might affect students’ problem-solving progress, 
and only kept the first compliable program from each stu- 
dent. In total, we have 448 and 307 student submissions 
from each semester, respectively. Please note that in Code- 
Workout, only submissions from students are recorded and 
sequences of student edits are not available, thus it is only 
evaluated on the task of student program classification. 


3.2 Task Description 

For the task of student program classification, the ground 
truth labels are generated as follows: in iSnap, a student’s 
submission is correct only if it satisfies all rubrics require- 
ments, which are based on the expert-designed features and 
verified by humans; in CodeWorkout, a submission is cor- 
rect if it passes all testing cases. Table 1 shows the number 
of correct and incorrect submissions across the semesters in 
each dataset. Note that here we include one submission per 
student to ensure that all data points are independent in 
both datasets. More specifically, for each student, we in- 
clude the student’s “own” submission before receiving any 
detailed feedback, which means the student’s final submis- 
sion for iSnap and the first submission for CodeWorkout. 


Table 1: Data overview on Student Program Classification 


‘Snap ee oo 
Semester : Semester i 
correct incorrect correct incorrect 
$16 24 Al $19 156 151 
F16 16 22 F19 223 265 
S17 12 17 Total 379 416 
F17 11 28 
Total 63 108 


3.3. Experiments 
3.3.1 Models Configuration 


We conducted a series of experiments across both domains 
by comparing ASTNN against the state-of-the-art model 
Code2Vec and three token-based classic ML models. 


Three Token-based ML Models: Three classic ML models, 
K-Nearest Neighbors (KNN), Logistic Regression (LG), and 
Support-Vector Machine (SVM) are explored. Following 
prior token-based approach, we applied TF-IDF to extract 
textual features [42, 34]. The input sentence for TF-IDF 
is the sequence of AST-tokens, which is generated by the 
pre-order traversal of original ASTs. For each of the three 
models, we explored different parameters to obtain the best 
results. For KNN, we had k = 10, for LG we used L1 reg- 
ularization, and for SVM we used linear kernel. Those pa- 
rameters are tuned from 10-fold cross-validation with grid 
search, and all three models are implemented through the 
sklearn library. 


Two AST-based Deep Learning Models: Code2Vec takes a 
set of AST-based paths as input, where the number of paths 
may vary from different student submissions. Thus we man- 
ually padded the number of paths to 100 over all code sub- 
missions. During the training, we set the maximum train- 
ing epochs as 200, with the patience of early stopping set 
to 100, tuned learning rate to 0.0002. Linear layer and em- 
bedding dimensions are kept default to 100. To ensure a 


highest efficiency of the model, we set the batch size as the 
full batch. For ASTNN, the inputs are a set of ST-trees, 
and we padded the statement sequences to the maximum 
length to accommodate the longest sequence before feeding 
to Bi-GRU. During the training, we leverage 32 as batch size, 
0.001 as learning rate, and keep the max training epoch as 
50. The encoding dim for the statement encoder is set to 
128, and the number of hidden neurons for Bi-GRU is set to 
100. We implemented both ASTNN and Code2Vec in Py- 
torch. Same as the classic models, 10-fold cross-validation 
was applied for hyperparameter tuning. 


For the task of student program classification, we did not 
compare ASTNN and Code2Vec against any models that 
used expert-designed features for two reasons: one is that 
the expert-designed features are only available for iSnap but 
not CodeWorkout; and the other reason is that these expert- 
designed features are used to determine the ground truth 
label of the student’s final submission in iSnap. 


3.3.2. Evaluation Metrics 

Our models were evaluated using Accuracy, Precision, Re- 
call, Fl Score, and AUC (Area Under ROC curve). Accu- 
racy represents the proportion of students whose labels were 
correctly identified. Precision is the proportion of students 
who were predicted to be incorrect by each model were actu- 
ally in the incorrect group. Recall tells us what proportion 
of students, who will actually be incorrect, were correctly 
recognized by the model. F1 Score is the harmonic mean of 
Precision and Recall that sets their trade-off. AUC measures 
the ability of models to discriminate groups with different 
labels. Given the nature of the task, in the following, we 
consider Accuracy and AUC as the most important metrics 
because the former is most commonly accepted while AUC 
is believed to be generally more robust. 


Finally, it is important to emphasize that all models were 
evaluated using semester-based temporal cross-validation for 
both domains in this task, which only applied data from pre- 
vious semesters for training and is a much stricter approach 
than the standard cross-validation. 


3.4 Results 


Table 2 and 3 compare the performing of the five models in 
iSnap and CodeWorkout respectively. In iSnap, among the 
three token-based models, LG and SVM have very similar 
performance as both have an accuracy score of 0.6604; more- 
over the best AUC and Precision are from LG and the best 
Recall and F1 are from SVM. Both LG and SVM outperform 
KNN on all metrics. While in CodeWorkout, Table 3 shows 
that the best accuracy, AUC, and Precision are from SVM 
and the best Recall and F1 are from KNN. Between the two 
AST-based models, ASTNN outperforms Code2Vec in both 
domains. It suggests that across the two different student 
programming environments, ASTNN is more effective than 
Code2Vec on the task of student program classification. 


The comparisons between AST-based models with token- 
based models show the former significantly out-perform the 
latter in both domains; the only exception is that SVM with 
token has the highest precision in Java (Table 3). Note 
that here the difference between the SVM and ASTNN on 
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Table 2: Student Program Classification Results in iSnap 


Feature | Model 


Accuracy Precision 


Recall AUC 


F1_score 


Majority Baseline 0.6321 - - - 0.5 
KNN 0.6132 0.7321 0.6119 0.6667 0.6137 
Tokens 0.6604 0.8298 0.5821 0.6842 0.6885 
SVM 0.6604 0.7460 0.7015 0.7231 0.6456 
ASTs Code2Vec Yee E08 eee Ore ae 
ASTNN 0.8113 0.8730 0.8209 0.8462 0.8079 


Note: best models in each group are in bold, and the overall best labeled with ** 


Table 3: Student Program Classification Results in CodeWorkout 


Feature | Model 


Accuracy Precision 


Recall AUC 


F1_score 


Majority Baseline 0.5430 - - - 0.5 
KNN 0.8709 0.8915 0.8679 0.8795 0.8712 
Tokens 0.8299 0.8922 0.7811 0.8330 0.8345 
SVM 0.8770 0.9437** 0.5093 0.6616 0.8822 
ASTs Code2Vec Peek 0.9299 Soacean O9822. Oe 
ASTNN 0.9529 0.9416 0.9736 0.9573 0.9509 


Note: best models in each group are in bold, and the overall best labeled with ** 


Precision is rather small while the former has a much worse 
accuracy, Fl-score, and AUC than ASTNN. 


To summarize, our results show that in both domains, ASTNN 


achieves the best performance. These results show that 
by capturing the meaningful linguistic structure in student 
code, ASTNN is indeed more robust on the task of student 
program classification. Given its effectiveness, we further 
explored the effectiveness of Temporal-ASTNN which com- 
bines ASTNN with powerful temporal model LSTM on the 
task of student success early prediction. 


4. STUDENT SUCCESS EARLY PREDICTION 


For student success early prediction task, Temporal-ASTNN 
is compared against the original ASTNN and other tempo- 
ral models. As mentioned in Section 3.1.2, here we only 
explored the early prediction task in iSnap. 


4.1 Task Description 

In iSnap, we have a total of 171 students and 31,064 tem- 
poral snapshots. Following the definitions used in prior re- 
search [23], the successful students are those who completed 
the programming assignment within one hour and got full 
credit while the rest are counted as unsuccessful. We have 59 
successful and 112 unsuccessful ones. The detailed statistics 
for iSnap dataset are shown in Table 4. Note that for the 
purpose of learning, unsuccessful students are of interest for 
this classification task. 


To predict student early success, we are given the first up to 
n minutes of a student’s sequence data and our goal is to 
predict whether the student will successfully complete the 
programming assignment at any given point in the remain- 
der of the sequence. To conduct this task, we left-aligned all 
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the students’ trajectories by their starting times and our ob- 
servation window (the part of data used to train and test dif 
ferent machine learning models) includes the sequences from 
the very beginning to the first n minutes. If a student’s tra- 
jectory is less than n minutes, our observation window will 
include their entire sequence except the last one. 


It is worth noting that student success early prediction is a 
much more challenging task compared to program classifi- 
cation: 1) besides the linguistic nature in student code, it 
also involves temporal information, and 2) the observation 
window is very early and thus student final submissions are 
not available for training or testing. 


4.2 Experiments 


4.2.1 Models Configuration 

To further explore the power of ASTNN, we did extensive ex- 
periments and compared it with the start-of-the-art expert- 
designed features [43] and token-based features on the stu- 
dent success early prediction task. For each of the feature 
embedding (expert, token, AST), we explored two categories 
of models: the last value-based Logistic Regression (LG) 
models, and the temporal LSTM models. Note that LG is 
selected because, among the three classic ML methods ex- 
plored on the task of student program classification in iSnap, 
LG has achieved the highest accuracy and AUC. 


Last- Value Models: Motivated by prior work, we used a “Last 
Value” approach [4, 37, 23] to treat the last measurements 
within the given observation window as the input to train 
models. For early prediction settings, we truncated all the 
sequences in the training dataset in the same way as the 
testing dataset. For example, when our observation window 
is the first 4 minutes, we will only apply the last values in 
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Table 4: Detailed data statistics for iSnap, including total steps, total time spent in minutes, and the success labels distribution 


for each of the four semesters. 


s A Total Steps Total Time (minutes) Success Labels 
emester | min max median mean(std) | min max median mean(std) successful unsuccessful 
S16 10. = 1024 169 199 (175) | 0.533 95.667 = .20.733 =. 22.777 (17.149) 23 42 
F16 28 884 121 167 (168) | 3.283 119.083 16.325 22.379 (24.177) 15 23 
S17 15 439 75 112 (94) 2.817 62.983 14.167 16.347 (11.872) 12 LG 
F17 10 = 2276 100 219 (376) 1.65 189.667 19.1 28.224 (33.869) 9 30 
Table 5: iSnap Student Success Early Predictions at First-4-minute Only 
Data Feature | Model Accuracy Precision Recall Fi_score AUC 
Majority Baseline 0.6604 - - - 0.5 
Expert | Expert-LG 0.6226 0.8261** = 0.5429 0.6552 0.6603 
Last-Value | Tokens | Token-LG 0.5566 0.7170 0.5429 0.6179 0.5631 
ASTs ASTNN 0.6698 0.7612 0.7286 0.7445 0.6421 
Expert | Expert-LSTM 0.7075 0.7191 0.9143 0.8050 0.6099 
aatceeal Tokens | Token-LSTM 0.6792 0.6915 0.9286** 0.7927 0.5615 
a AST Temporal-ASTNN,,.,, | 0-7642** 0.7711 0.9143  0.8366**  0.6933** 
- Temporal-ASTNNoy tire 0.7453 0.7722 0.8714 0.8188 0.6857 


Note: best models in each group are in bold, and the overall best labeled with ** 


the sequence within the first-4-minute observation window 
and use them as inputs for each model. More specifically, 
we used the expert features of the last submission within the 
observation window to train and test expert-LG; similarly, 
the tokens from the last snapshot within the observation 
window to train and test token-LG; and the ASTs of the last 
submission within the observation window for both training 
and testing the original ASTNN. 


Temporal Models: We applied LSTM to handle the tem- 
poral sequences of student code. Here we used the tem- 
poral sequences in the observation window for early predic- 
tions. Specifically for a given first-n-minute observation win- 
dow: we used the sequences of expert features to train and 
test expert-LSTM; the sequences of token features to train 
and test token-LSTM. For Temporal-ASTNN, we explored 
Temporal-ASTNN>7,,,,. and Temporal-ASTNNgyiire. Both 
models would first convert student code sequences in the 
observation window into sequences of AST vectors and then 
feed them into LSTM. They only differ on how their AST 
vectors are trained: the former uses truncated sequences 
while the latter uses entire sequences (see Section 2.1.3). 


To summarize, we analyze two main model settings: last- 
value and temporal, together with three different feature 
embeddings: expert, tokens, and ASTs. Thus in total we 
explored the effectiveness of six models. 


4.2.2. Evaluation Metrics 

For student success early prediction, all the models are eval- 
uated using Accuracy, Precision, Recall, F1 Score, and AUC. 
Similarly to the first task, we consider Accuracy and AUC as 
the most important metrics, and the more stringent semester- 
based temporal cross-validation was carried out. 


4.3 Results 


We present our results of student success early prediction by 
first comparing the effectiveness of all six models on first-4- 
minute early prediction and then by exploring their average 
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performance across different observation windows up to the 
first-10-minute data. 


4.3.1 Results at First-4-minute Only 

Table 5 shows different performance measures of all the six 
models at first-4-minute. In the group of Last- Value mod- 
els, ASTNN has the best accuracy, Recall and F1 scores 
while the best AUC and Precision are from Expert-LG, and 
both of them have better performance than Token-LG. Actu- 
ally, in terms of accuracy, Expert-LG and Token-LG perform 
worse than the simple majority baseline. This is probably ei- 
ther because only relying on the first-4-minute is too early or 
because the last snapshot of the first-4-minute does not pro- 
vide enough information for these models to make effective 
early predictions. The fact that across the five evaluation 
metrics, the best performance either comes from Expert fea- 
ture or comes ASTNN suggests that ASTNN is comparable 
to expert-designed features because of its ability of handling 
the linguistic structure of student syntactic code. 


In the Temporal group, Temporal-ASTNN based models are 
the best. More specifically, both Temporal-ASTNN7,unc 
and Temporal-ASTNNpyitir. Outperform Expert-LSTM and 
Token-LSTM on accuracy, AUC, precision and F1 scores, 
except that the best recall is from Token-LSTM. Between 
the two Temporal-ASTNN models, Temporal-ASTNNoayunc 
is generally better than Temporal-ASTNN gg, i4;,. a8 it achieves 
higher accuracy, Recall, Fl-score, and AUC. This is proba- 
bly because by using the truncated training data for repre- 
sentation learning, Temporal-ASTNN7,,,,. is more likely to 
capture the temporal information that are not only predic- 
tive of student success but also more likely to be observed 
in the testing with only the first-4-minute data. 


When further comparing temporal models with last-value 
models, we can see that all temporal models achieve better 
accuracy than their corresponding last-value models. It is 
reasonable since temporal models are able to capture the 
temporal information related to student success from the 
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Table 6: iSnap Student Success Early Predictions in First-10-minute Overall 


Data Feature | Model Accuracy | Precision | Recall Fi_score | AUC 
Majority Baseline 0.6604 - - - 0.5 
0.6566 0.8209** 0.6229 0.7017 0.6725 
Peet ren (arith Revere (0.05) | (0.06) | (0.11) | (0.06) | (0.03) 
Tokens | Token-LG 0.5528 0.7072 0.5571 0.6203 0.5508 
(0.02) 0.03 (0.07) (0.04) 0.03) 
0.6642 0.7635 0.7200 0.7379 0.6378 
Pate pase (0.01) 0.03 (0.07) | (0.02) 0.03) 
0.7189 0.7305 0.9343 0.8145 0.6255 
Bapert, | EapertLetM 0.03) 0.05 (0.04) | (0.01) | (0.07) 
Temporal 0.6887 0.6966 0.9429** | 0.8001 0.5687 
Doken: [Polen lea 0.02) 0.03 (0.04) | (0.01) | (0.05) 
0.7396 0.7597 0.8914 | 0.8190** | 0.6679 
Ast, < || TomporabASTNNagi:, | “79,055 0.03 (0.03) | (0.01) | (0.04) 
0.7472** 0.7932 0.8316 0.8110 0.6943** 
Temporal ASTNNenie | (9,02) | (0.04) | (0.04) | (0.02) | (0.03) 
Note: best models in each group are in bold, and the overall best labeled with ** 


temporal sequences, but such information is not available to 
last-value models. 


Generally speaking, Temporal-ASTNN achieves the best per- 
formance at the first-4-minute observation window, which 


indicates that by combining ASTNN with LSTM, the temporal- 


ASTNN is able to learn the temporal and linguistic knowl- 
edge from student code sequences. 


4.3.2 Results in First-10-minute Overall 

Figure 6 (a) and (b) report Accuracy and AUC performance 
respectively for four models predicting student success: three 
temporal models and the best last-value model, ASTNN. 
For each graph, x-axis is the observation window of early 
prediction, here we vary the observation window from the 
first 2 minutes up to 10 minutes; and y-axis is the Accu- 
racy/AUC score. As shown in Table 1, students generally 
take 10 to 60 minutes to complete the task and thus we took 
a measurement every 2 minutes for the first 10 minutes to 
generate the early stage predictions for each model. Table 6 
show the comparison of all six models for the student success 
early prediction in first-10-minute observation windows, we 
reported the mean value and corresponding standard devia- 
tion (in parenthesis) for each evaluation metric. 


Table 6 shows a similar pattern as we observed earlier in 
Table 5. In the group of Last-Value models, ASTNN out- 
performs Expert-LG and Token-LG. Specifically, ASTNN 
continues to achieve the best accuracy, Recall and F1 scores 
in the first 10 minutes, and Expert-LG has the best AUC 
and Precision scores. In the group of temporal models, 
Temporal-ASTNN based models are still the best overall, 
with higher scores on accuracy, AUC, Precison and F1. Ad- 
ditionally, Temporal-ASTNN»,i;,. is shown to be slightly 
better than Temporal-ASTNN as it achieves higher ac- 
curacy, AUC and Precision. 


Trunc 


Both Figures 6 (a) and (b) show that Temporal-ASTNNputire 
is the best model for student success early prediction as it 
stays on the top across all sizes of the observation window. 
As the length of observation window extends, all temporal 
models in general perform better, while the performance of 


last-value models fluctuates. This is because that training 
data includes more and more information and hereby the 
performance of temporal models improves over longer se- 
quences. After 6 minutes, Expert-LSTM starts to perform 
as good as Temporal-ASTNN, which is not surprising. As 
the expert features are designed to detect student state for fi- 
nal grading, and student states will be more and more closer 
to their final submissions with the longer sequences. The 
fact that the best early predictions come from Temporal- 
ASTNN really suggests that addressing both linguistic and 
temporal nature of student code sequences brings us closer 
to the truth of student learning procession during program- 
ming, especially for the early stage (first 6 minutes). 


5. RELATED WORK 


5.1 Linguistic-based Models for Programming 
A wide range of work has applied NLP techniques for pro- 
gramming. Traditionally, some prior work directly uses the 
tokens of AST’ for source code tasks [38, 12], by treating 
programming languages as natural languages. Despite some 
similarities, programming languages and natural languages 
[25] differ in some important aspects. Programming is a 
complex activity, and thus programs contain rich and ex- 
plicit structural information. Recently, deep learning models 
has shown the potential to grasp more information from AST 
in many tasks. For example, TBCNN [24] takes the whole 
AST of code as input and performs convolution computation 
over tree structures, and it outperforms token-based models 
in program functionalities classification and bubble-sort de- 
tection. In the educational domain, Piech et al. (2015) pro- 
posed NPM-RNN to simultaneously encode preconditions 
and postconditions into points where a program can be used 
as a linear mapping between these points [30]. Gupta et al. 
(2019) presented a tree-CNN based method, that can local- 
ize the bugs in a student program with respect to a failing 
test case, without running the program [14]. More recently, 
ASTNN and Code2Vec has shown great success. 


Siting at the root of AST, ASTNN [42] was proposed to han- 
dling the long-term dependency problems when taking the 
large AST as input directly. AST is a form of representing 
abstract syntactic structure of the source code [5], and it 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 179 


@ Majority Baseline 
@® Expert-LSTM 
@ Temporal-ASTN_Trunc 


© ASTNN 
st Token-LSTM 
@ Temporal-ASTNN_Entire 


0.8 


0.75 


Accuracy 
Oo 
N 


0.65 


Cl 
2 4 6 8 10 


(a) Accuracy performance 


© ASTNN 
a Token-LSTM 
@ Temporal-ASTNN_Entire 


@ Majority Baseline 
@ Expert-LSTM 
@ Temporal-ASTN_Trunc 


AUC 


2 4 6 8 10 
(b) Area under ROC performance 


Figure 6: Student Success Early Prediction on iSnap, last- 
value models are in dashed lines with empty symbols, tem- 
poral models are in solid lines with solid symbols, dark grey 
lines are from the majority baseline. 


has been widely used in the domain of source code analysis. 
Similar to long texts in NLP, large ASTs can make deep 
learning models venerable to gradient vanishing problems. 
To address the issue, ASTNN splits the large AST of one 
code fragment into a set of small trees in statement-level and 
performs code vector embedding. It achieves state-of-the-art 
performance in both code functionalities classification and 
clone detection. 


Code2Vec [1], on the other hand, utilizes AST-based paths 
and attention mechanism to learn code vector representa- 
tion. Instead of a set of ST-trees, it takes a collection of 
leaf-to-leaf paths as input, and applies an attention layer 
to average those vectors. As a result, the attention weights 
can help to interpret the importance of paths. Code2Vec 
has shown to be very effective in predicting the names for 
program entities. Shi et al. (2021) also applied Code2Vec 


on a block-based programming dataset and used the learned 
embedding to cluster incorrect student submissions [34]. 


As far as we know, none of prior work has directly compared 
the effectiveness of ASTNN against Code2Vec. And in this 
work, we did extensive experiments across two programming 
domains: one is a block-based novice programming envi- 
ronment where the data size is relatively small; the other 
is a web programming platform in Java, in which more la- 
beled data is available. Our results consistently suggest that 
ASTNN is able to capture more insights from student pro- 
grams for correctness prediction. 


5.2 Student Modeling for Programming 
Student modeling has been widely and extensively explored 
by utilizing student temporal sequences. For example, BKT 
[8] and BKT-based models have been shown to be effective in 
predicting students’ overall competence [26], predicting the 
students’ next-step responses [41, 3, 27, 20], and the predic- 
tion of post-test scores [18, 22]. In recent years, deep learn- 
ing models, especially Recurrent Neural Network (RNN) or 
RNN-based models such as LSTM have also been explored 
in student modeling [29, 36, 17, 39, 40, 19]. Some work 
showed that LSTM has superior performance over BKT- 
based models [22, 29] or Performance Factors Analysis [28]. 
However, it has also been shown that RNN and LSTM did 
not always have better performance when the simple, con- 
ventional models incorporated other parameters [17, 39]. 


In the programming domain, prior research has explored var- 
ious temporal models for modeling student learning progres- 
sion. For example, Wang et al. (2017) applied a recursive 
neural network similar to [30] as the embedding for student 
submission sequence, then feed them into a 3-layer LSTM 
to predict the student’s future performance. Please note 
that the work is quite different from our proposed Temporal- 
ASTNN. In Temporal-ASTNN, all the components are opti- 
mized together during training, while they applied a global 
embedding to generate the input sequences for LSTM. On 
the other hand, Emerson et al. (2019) have utilized four cat- 
egories of features: prior performance, hint usage, activity 
progress, and interface interaction to evaluate the accuracy 
of Logistic Regression models for multiple block-based pro- 
gramming activities [11]. In our earlier work, we have used 
the expert-designed features for a block-based programming 
problem to train various temporal models, then made early 
predictions on student learning outcomes [21, 23]. 


To our best knowledge, while most of the previous studies on 
analyzing student programming data treated student code as 
either linguistic or temporal, no prior work has combined the 
two characteristics of programming data for student learning 
progression. Thus our proposed Temporal-ASTNN is the 
first attempt to addressing both aspects in student code. 


6. CONCLUSIONS 


Tracing student learning progression at early stage is a cru- 
cial component of student modeling, since it allows tutoring 
systems to intervene by providing needed support, such as 
a hint, or by alerting an instructor. Both prediction tasks 
involved in this work are challenging, especially the early 
prediction task because: 1) the open-ended nature of pro- 
gramming environment hinders the prediction of student fi- 
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nal success, and 2) it is extremely hard to learn a meaningful 
representation from student code. In this work, we con- 
ducted a series of experiments to investigate the effective- 
ness of Temporal-ASTNN for student learning progression. 
We first evaluated ASTNN against Code2Vec on the task of 
classifying the correctness of student programs across two 
domains. Our results show that ASTNN consistently out- 
performs the other models including Code2Vec and other 
token-based baselines in both domains. And we can also 
find that AST-based models generally achieve better per- 
formance than token-based models, which is consistent with 
prior research [24, 42]. In the second task of student early 
prediction, we explored three different categories of features: 
expert, tokens, and ASTs. And further compared Temporal- 
ASTNN with other temporal models embedded with dif- 
ferent feature set, as well as non-temporal baselines. Our 
findings can be concluded as follows: 1) temporal models 
usually outperforms non-temporal (last-value) models; 2) 
token-based models can only capture very limited informa- 
tion from student code; and 3) Temporal-ASTNN is the best 
out of all models in the early prediction task, it can achieve 
good performance with only the first-4-minute data. 


Limitations: There are two main limitations in this work. 
First, we only explored the effectiveness of Temporal-ASTNN 
on one important student modeling task in one programming 
environment, and thus it is not clear whether the same re- 
sults will hold for different tasks or in other programming do- 
mains. Second, time-aware LSTM [6] has shown to outper- 
form LSTM on various early prediction tasks [23], while in 
this work we only compared our Temporal-ASTNN against 
normal LSTM without considering time-awareness. Never- 
theless, one of the main goal in this work is to investigate 
the robustness of Temporal-ASTNN from both sequential 
and temporal embedding. Thus we have two different type 
of models (last-value vs. temporal) as well as another two 
different features (expert and tokens). Our experiments re- 
sults have shown its superiority on both aspects, but still, 
we are not clear about the effects of time-awareness. 


Future Work: An important direction for future work is to 
investigate the time-awareness on Temporal-ASTNN to de- 
termine how it contributes to the model in the same task. In 
addition, we are planning to employ Temporal-ASTNN to 
other temporal tasks or different domains to explore whether 
it continues to support improvement for programming envi- 
ronments. Also, this work will be applied to larger groups of 
students and longer programming tasks, along with integra- 
tion of more informative features such as intervention and 
demographic features to develop more robust models. 
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ABSTRACT 


Constructing effective and well-balanced learning groups is 
important for collaborative learning. Past research explored how 
group formation policies affect learners’ behaviors and 
performance. With the different classroom contexts, many group 
formation policies work in theory, yet their feasibility is rarely 
investigated in authentic class sessions. In the current work, we 
define feasibility as the ratio of students being able to find 
available partners that satisfy a given group formation policy. 
Informed by user-centered research in K-12 classrooms, we 
simulated pairing policies on historical data from an intelligent 
tutoring system (ITS), a process we refer to as SimPairing. As 
part of the process for designing a pairing orchestration tool, this 
study contributes insights into the feasibility of four dynamic 
pairing policies, and how the feasibility varies depending on 
parameters in the pairing policies or different classes. We found 
that on average, dynamically pairing students based on their 
in-the-moment wheel-spinning status can pair most struggling 
students, even with moderate constraints of restricted pairings. In 
addition, we found there is a trade-off between the required 
knowledge heterogeneity and policy feasibility. Furthermore, the 
feasibility of pairing policies can vary across different classes, 
suggesting a need for customization regarding pairing policies. 


Keywords 


Peer tutoring, Learning Group formation (LGF), Pairing Policies, 
CSCL 


1. INTRODUCTION 


Constructing effective, well-balanced learning groups is an 
important task in computer-supported collaborative learning 
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(CSCL) [1-3]. The importance of learning group formation (LGF) 
has been validated empirically [4,5]. For instance, Webb et al.’s 
experiment proved that group composition had a major impact on 
the quality of group discussion and students’ test scores, both 
during group work and subsequent individual tests [5]. The 
majority of existing approaches to LGF, do not support dynamic 
group formation [1]. Dynamic group formation refers to the 
process of groups “created on demand while various 
domain-specific restrictions have to be considered” [6], or can 
“adapt to and benefit from previous information about group 
members and their abilities” [7,8]. Compared to static, 
pre-planned LGF, the dynamic composition of groups allows for 
quick regrouping of learners based on up-to-date information 
regarding their progress and struggle. Dynamic group formation is 
an interesting issue, as researchers start envisioning more 
sophisticated and personalized classroom interactions [9] and 
more fluid social transitions (i.e., student social transitions that 
occur not all at the same time for everyone in the class) [10], that 
are more challenging to orchestrate. 


In the context of an Intelligent Tutoring System (ITS) that 
supports both individual and collaborative learning, it is useful to 
investigate whether dynamically switching students between the 
two modes, as the need arises, can be effective and feasible. 
Pairing policies that work well in practice ideally have 
characteristics of both effectiveness and feasibility. By effective 
we mean that the pairing policy leads to students’ reaching desired 
learning goals, and by feasible we mean that enough partners can 
be found under the given grouping policies (i.e., good policy 
coverage). Specifically, we defined feasibility as the percentage of 
students who can be teamed up under a given pairing policy. 


The feasibility of LGF is an important issue to investigate in 
designing orchestration tools for teachers, and can be a central 
concern at the initial stage of tool design. This is because during 
the initial design stages we often do not yet have data to 
rigorously evaluate the effectiveness of LGF, given testing the 
LGF requires human resources of learners, instructors, materials 
resources of devices, systems, and a long time period. 
Additionally, an effective pairing policy that only covers a small 
percentage of students in a classroom may have limited influence 
for the whole class. Thus, the feasibility of LGF can be important 
in providing context for the potential coverage of LGF in a class. 
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Literature on LGF in collaborative learning is vast. Researchers 
have paired students based on gender [11,12], learning style 
[13-16], students’ social network [17], and their intelligence or 
task proficiency [18-20]. Heterogeneous and homogeneous group 
formation are two main approaches in team formation, and many 
studies have demonstrated their effectiveness in CSCL 
[1,8,18,20-22]. Students’ knowledge level is argued to be the 
most suitable and important attribute to form educational groups 
[8]. Prior work has also used machine learning or other algorithms 
to incorporate multiple factors for optimizing team formation 
[23-26]. However, the literature on LGF provides little insight 
into the feasibility of these LGF policies, especially in the ITS 
context. 


Evaluating the feasibility of LGF policies offline, prior to 
implementing them in real classrooms, is challenging, given a 
lack of readily accessible approaches. To address this problem, we 
adopt a process we call “SimPairing”, to simulate pairing policies 
on authentic data and evaluate their feasibility. In this process, we 
used transaction data from several classes of students using an 
ITS, collected from classroom studies conducted in U.S. middle 
schools. We computed and analyzed how the feasibility of several 
LGF policies (described below) changed as each class progressed, 
and how the feasibility varied across different classes. Replaying 
historical data to simulate possible futures (e.g., Replay 
Enactment [27]), has been used as a method by researchers to 
design tools with similar data-driven, human-centered approaches 
[28]. Diana et al. [29], for instance, used machine learning (ridge 
regression) to predict students’ grades based on historical data in 
CS education (i.e., programming). Based on these predicted 
grades and simulated students’ “helped” status, they determined 
which students needed help and which may be able to provide 
help. They then used a network graph of code-state to search for 
potential peer tutors who shared a common ancestor node with the 
tutee. They found that grouping low-performing students together 
and using better model features can increase the number of 
students helped. Their findings suggest that using low-level log 
data to group and match low-performing students with a peer tutor 
may be an effective way to increase the amount of help given in a 
classroom. In contrast, we simulated different policies selected 
based on literature and teachers’ common practice revealed in 
user-centered research with K-12 teachers [10,30,31], in a 
mathematics education context. 


The current work is, to the best of our knowledge, the first to look 
at dynamic pairing policies that consider students’ in-the-moment 
wheel-spinning status. Identifying students who are 
unproductively struggling, yet failing to master the skill, (i.e., 
wheel spinning) is a first step to getting them unstuck [32]. While 
there has been significant work on modeling and predicting wheel 
spinning [33-35], little work has been dedicated to developing 
interventions to get them unstuck, with a few recent exceptions 
[36,37]. While a typical classroom has students who are 
struggling on problems and those who have excelled on the same 
problem, the latter students’ expertise is rarely utilized. Instead, 
often the only source of help is the instructor, who is likely unable 
to help all the students who need help within the time constraints 
of the class period [38]. Peer tutoring (i.e., pairing a struggling 
student with a peer tutor) could be an effective way to help get 
struggling students unstuck when the instructor has their hands 
full. 


Lastly, instead of prescribing a specific grouping criterion, our 
work envisions that instructors will customize pairing policies and 
parameters to their classroom contexts, which prior work argued 


to be especially helpful in the LGF process [1,8, 30,39]. Amara et 
al. found that most of the proposed LGF solutions do not allow 
instructors to customize the grouping process [1]. They argued 
that it is less helpful to apply a grouping solution for all types of 
learners, and more useful to leave the choice to instructors. 
Instructors can then form groups according to different learning 
objectives, learners’ needs, activity types, and customize the LGF 
process according to location and time [1]. Similarly, Echeverria 
et al. envision adaptability in an orchestration system, which 
“enables teachers to select the best pairing policies based on their 
particular goals, needs, and classroom dynamics” [30], to be 
helpful for different classrooms. In the current investigation, three 
of the four policies we studied involve an adjustable pairing 
threshold or parameter, which we simulated with various values. 


In sum, the current work investigates the feasibility of four 
dynamic LGF policies derived from user research with math 
teachers. We investigate from three angles: overall session 
simulation, class-level variance, and session-level contrasting 
cases. This work contributes to the feasibility results of the 
dynamic pairing policies, recommendations for orchestration tool 
design, and highlights future work regarding tools supporting 
dynamic LGF. 


2. STUDY CONTEXT 
ZA Intelligent Tutoring Systems 


This study used student transaction data collected from classroom 
studies in U.S. middle schools (dataset link). This data logged 
students’ interaction with an ITS called Lynnette, which offers 
guided practice to students in basic equation solving. ITS (also 
called AlJ-tutors) are increasingly common in K-12 classrooms to 
help teachers more effectively personalize instruction [40]. As 
shown in Figure 1, Lynnette provides step-by-step guidance, in 
the form of adaptive hints, correctness feedback, and error 
specific messages. Lynnette supports personalized mastery 
learning, and has been proven to improve _ students’ 
equation-solving skills in several classroom studies [41-43]. 


You have only a variable term on the right: -x. Also, you have only a constant on the left: -1. You can get the variable by 
itself by dividing both sides by the coefficient of the variable. 


Please solve for x 


Figure 1. Example student interface for the ITS, Lynnette 


The transaction data logs detailed events by the timestamp of 
students’ interaction with the ITS, including but not limited to 
actions they take (e.g., requesting hints or attempting a step), 
knowledge components (KC) that a transaction involves, and skill 
mastery, calculated based on Bayesian Knowledge Tracing (BKT) 
student model, a two-state Hidden Markov Model. BKT is a 
popular student model that has been successful for various 
applications in the educational technology literature (e.g. [44]). 


The current work lays a foundation to (in the future) use Lynnette 
in combination with a second ITS, APTA, which extends 
Lynnette’s functionality to support reciprocal peer tutoring. APTA 
allows two students to respectively take the role of tutor and tutee. 
In APTA, the tutee can seek help from their partner, while the 
tutor can see the tutee’s progress, and help them to make progress 
with the math problem at hand. APTA supports the peer tutor in 
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tutoring the tutee. Classroom studies with APTA have 
demonstrated that its adaptive support can improve the quality of 
help peer tutors give and improve students’ domain learning 
[45,46]. A future effort for the current work is to implement 
feasible student pairing policies in an orchestration tool, to 
support teachers in dynamically pairing students to work 
collaboratively in APTA. Such a tool plays a key role in our 
vision for the smart classroom of the future, in which students 
alternate fluidly between individual and collaborative learning. 


2.2 Wheel-Spinning Detector 

Detectors have been developed to detect student behaviors of 
interest (e.g., gaming the system, struggling) from the transaction 
data. Such detectors have been used to design dashboards or tools 
that can alert teachers of certain student status (e.g., [9]). In our 
policies 1 and 2, we paired students based on their struggle status 
indicated by a wheel-spinning detector. 


Wheel-spinning, as defined by Beck and Gong, denotes students 
who are failing to master a specific skill after many attempts in an 
intelligent tutoring system [32]. We utilized a detector that 
adopted the same criterion as defined by Beck and Gong [32]. The 
detector is embedded in LearnSphere, (i.e., a large learning 
analytics infrastructure) [47]. The detector considers students who 
have over ten practice opportunities yet still failing to reach a skill 
mastery on a specific knowledge component (KC) of above 0.95, 
to be wheel-spinning on this KC [9]. Such prolonged repeated 
struggles are likely to be an inefficient use of time for students 
(32] and may contribute to a lack of motivation for future learning 
[36]. Wheel-spinning is one type of unproductive struggle, and we 
use struggling and wheel-spinning interchangeably in this paper. 


ai METHODS 


We evaluated how feasible four pairing policies (described below) 
are, based on simulation with historical transaction data from 
Lynnette. We applied each pairing policy to data from each class 
session. For every minute in a session, we calculated the 
percentage of students who met the policy’s criterion for being 
teamed up. Based on this calculation, we evaluated policy 
feasibility using two measures, FI, and FI,, defined in 3.2. In the 
simulation process, we did not make assumptions about how long 
simulated collaboration episodes would last. We foresee that in 
any of these episodes, students will be given the task of 
collaboratively solving several math problems; it is hard to predict 
how long that will take them. We thus did not simulate taking 
tutors or tutees out of the pool of students available for teaming 
up, or returning them to this pool, at the beginning and end of 
collaborative episodes, respectively. Although this simplification 
might introduce some inaccuracy into the simulation results, it 
may be hard to do better. As well, the asymmetric roles that 
paired-up students have in the pairing policies may limit the 
inaccuracy. For example, simultaneously keeping a struggling 
student and a non-struggling in the pool instead of taking them 
both out might have offsetting effects in terms of feasibility. 


Our simulation involved four pairing policies, namely: 


Policy 1 - Struggle with Non-Struggle: Pairing students who are 
wheel-spinning (unproductive struggle) with students who are not 
wheel-spinning. 


Policy 2 - Pairing with Restriction: Pairing students who are 
wheel-spinning with those who are not wheel-spinning, with a 
varying pairing restriction (PR) rate B. The PR rate simulates 
restrictions regarding who can collaborate with whom, which in 
real life would be provided by the teacher. 


Policy 3 - Knowledge Difference Pairing: Pairing students whose 
knowledge levels (as measured by the tutor’s BKT) differ by more 


than a certain threshold a. 


Policy 4 - Knowledge Similarity Pairing: Pairing students whose 
knowledge levels (as measured by the tutor’s BKT) differ by /ess 
than a certain ceiling y. 


The distinction in these policies aligns with Amara et al.’s 
categorization for dynamic group formation [1]: intra-session and 
inter-session grouping. Intra-session grouping allows for changing 
group members during the learning process, which is useful, for 
example, for synchronous mobile collaborative learning [1]. In 
inter-session grouping, groups are formed only before starting or 
after ending the learning process. Specifically, policies 1 and 2 fall 
under intra-session grouping since we simulated pairing students 
based on their in-the-moment struggle. These two policies also 
concer fluid social transitions [10], since the students in a given 
class may transition from individual to collaborative learning at 
different times. Our pairing policies 3 and 4 concern inter-session 
grouping, and pair students based on their initial knowledge level. 
To apply these policies, teachers or the tutoring system would 
assess students’ knowledge level, prior to (or at the beginning of) 
a class session. 


The research questions we aim to answer are: 


RQI1: Based on a pairing simulation done with students’ historical 
transaction data, how feasible are the four pairing policies? 


RQ2: How does varying the parameters in the pairing policies 
affect the feasibility of pairing students? 


RQ3: Does the feasibility of the pairing policies vary for different 
classes or sessions, if so, how? 


3.1 The Four Pairing Policies 
3.1.1 Policy 1: Struggle with Non-Struggle 


Description. Policy | utilizes the struggle detector (section 2.2) to 
pair students who are wheel-spinning with those who are not. The 
struggle detector assumes students’ wheel-spinning status to be a 
binary value for a given timestamp. Inspired by the work of Diana 
et al. [29], we categorized students in the Struggle Pool if they 
were wheel-spinning on at least one KC, indicating they could 
need help from a partner. Students not wheel-spinning on any KC 
were categorized in the Tutor Pool and considered as available 
tutors. We simulated pairing students in the Struggle Pool with 
students in the Tutor Pool. To determine the feasibility of this 
policy, we calculated the percentage of struggling students who 
had a potential partner (for more detail, see below). 


Rationale. Literature suggests that when students are 
wheel-spinning, giving them more of the same type of math 
problems to solve may not be productive [36]. When 
wheel-spinning, students would likely benefit from instructor 
attention or extra instruction. However, prior user research in the 
classroom (e.g. [9]) found that teachers often cannot help all 
struggling students. In this case, wheel-spinning students may 
benefit from a peer tutor’s help, which leads to a policy that seeks 
to dynamically find them partners [36]. 


3.1.2. Policy 2: Pairing with Restriction 

Description. Policy 2 is an extension to Policy 1, where we pair a 
struggling student with a non-struggling student, while enforcing 
a constraint that not all students are eligible for teaming up. The 
proportion of ineligible students is captured as the Pairing 
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Restriction (PR) rate. The PR rate is used to simulate situations 
where the teacher prefers that certain students do not work 
together. Specifically, we simulate pairing students in the Struggle 
Pool with students in the Tutor Pool, while enforcing the 
restriction that B% (0 < B <1, step = 0.1) of students in the Tutor 
Pool are ineligible as partners. For example, a PR rate B of 0.2 
means 20% of the students in Zutor Pool have been restricted 
from working with any students in Struggle Pool. It is important 
to know how these restrictions affect the feasibility of the policies. 


Rationale. We designed this policy based on results found in a 
survey we conducted with 54 middle-school math teachers on 
their pairing preferences in collaborative learning [31] and 
semi-structured interviews conducted with middle school 
teachers. Teachers expressed a desire to set constraints so that 
certain pairs of students are restricted from working together. 
Previous studies and user research by Olsen et al. and Echeverria 
et al. also informed the idea of ruling out certain pairings in 
advance [10,30]. Such restrictions usually arise from information 
or concerns teachers have about their students’ traits, behaviors 
and interpersonal relationships [8]. 


3.1.3 Policy 3: Knowledge Difference Pairing 
Description. In Policy 3, we pair students who have different 
Initial Knowledge (IK) levels. In a practical scenario, teachers 
may assess students’ knowledge through quizzes or exams. 
Alternatively, if the classrooms use ITS, teachers may have 
students practice several math questions individually, prior to 
transitioning into collaborative learning activities. 


To simulate this policy without having pre-assessment data, we 
used data from the tutoring sessions (captured in the log data) to 
compute students’ IK levels. Specifically, we computed a 
student’s IK for each KC, as the average mastery for their first 
three opportunities for this KC. The reason is we want to use up 
only a small portion of the data from the tutoring session, so the 
measure represents initial knowledge. In our datasets, three 
opportunities generally fall in the first quartile (25%) of students’ 
total number of opportunities for any given KC. Another reason 
we chose the cutoff of three is a previous EDM study with 
ASSISTments data showed student learning often appeared to 
occur, after students have had ten opportunities with the target 
knowledge [48]. Thus one may assume learners to have little 
learning on their first three times in transaction data practicing a 
KC. A student’s overall IK S; (j = N) is calculated as the average 
of their IK across KCs. To more accurately calculate students’ [K, 
we limit our simulation to sessions that practiced the first (i.e., the 
most basic) level of KCs, involving 25 sessions. 


KD was the difference between two students' IK, and denoted as 


Siz 


differences between two students’ IK: 


(j,k © N), which was calculated as the absolute value of 


KD(S,,) = |IK(S ,) — IK (S | 


Inspired by Huang and Wu’s work that proposed a clustering LGF 
method that considers a threshold of learner heterogeneity [49], 
this work similarly considers a KD threshold. For this policy, the 
required KD of two students (S1, $2) should be a minimum of a (0 
<a<l, step = 0.1) for them to be eligible to pair up. 


Rationale. The heterogenous pairing policy was informed by 
findings from user research with math teachers. In the survey 
conducted with 54 math teachers, we found the most common 
way teachers paired students was pairing those who have a 


different level of knowledge (67%, N = 34) [31]. In our study, we 
use students’ mastery of knowledge components (i.e., targeted 
math skills) calculated based on the BKT model to represent 
students’ knowledge. In a systematic literature review on LGF in 
CSCL, Maqtary et al. found the knowledge level is the most 
commonly used attribute in LGF, which they claim to be the most 
suitable and important attribute to form educational groups 
because of its effects on the group process [8]. 


There is a range of research that shows heterogeneous grouping 
can promote positive interdependence, better group performance, 
and effective interactions [1,49-52]. Heterogeneous group 
composition not only enhances elaborative thinking, but also leads 
learners to deeper understanding, better reasoning abilities, and 
accuracy in long-term retention [49,50]. Research also suggests 
that collaborative learning with heterogeneous group composition 
by characteristics such as gender, ability, achievement, 
social-economic status (SES), or race, can be beneficial [51]. 


3.1.4 Policy 4: Knowledge Similarity Pairing 
Description. Policy 4 is analogous to Policy 3, with the same 
definition of KD and IK as in Section 3.1.3. To pair students with 
similar knowledge, using the same calculation as Policy 3, this 
policy simulated pairing students that have a small KD. To be 
eligible for students to form a pair under this policy, the KD of 
two students (S1, S2) should be /ess than or equal to y (0<y<l1, 
step = 0.1). For example, when y = 0.2, two students with 
knowledge of 0.6 and 0.75 (KD = 0.15, below y) would be 
eligible to pair, but another pair with knowledge of respectively 
0.5 and 0.8 (KD = 0.3, above y) would not be eligible. 


Rationale. Policy 4 was inspired by prior literature and informed 
by user research. Literature suggests that homogenous groups can 
be beneficial for students’ learning. For example, Fuchs et al. 
found homogenous dyads generated greater cognitive conflict and 
produced better quality work than heterogeneous groups [22]. 
Additionally, among 54 teachers we surveyed, 43% reported that 
they pair students with a similar level of knowledge [31]. This 
was the third most popular grouping method that teachers 
commonly adopt (43%, N = 23), following strategies of pairing 
students with different knowledge (Policy 3) and pairing students 
randomly /3//. 


3.2 Metrics 
In this section, we describe the metrics to evaluate the pairing 
policies. We discuss how prior work informed the metric 
definitions, and how different metrics could be suitable to 
evaluate different policies. We build on Diana et al.’s work [29], 
who defined an Efficiency Index (EI) as a measure of a pairing 
algorithm’s performance, specifically: 

EI= LowP erformingStudentsH elped/BeingH elped 

LowP erforming Students 
We adapted EI into two metrics of interest for our pairing policies: 
Feasibility Index 1 and 2. FI, is the percentage of students who 
can be paired among all struggling students in a session. 
og. eys — StrugglingStudentsC ouldBeHelped 
Feasibility Index - 1 (FI) TotalS trug gling Students 

FI, is the ratio of paired students among all the students in a 
session. 

Feasibility Index - 2 (FI,) = 


SuudenisP aired 
TotalStudents 


For Policies 1 and 2: Given the goal to pair all struggling students 
in the session, FI, was a suitable measure for policy feasibility, 
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showing what percentage of students who are wheel-spinning can 
get help. For Policies 3 and 4: Given the goal to pair all students 
in the session who satisfied a certain KD, FI, was a suitable 
measure for policy feasibility, as it calculated the percentage of 
the paired students out of the total students. 


3.3. SimPairing Approach 


There are three main steps in SimPairing: 1) data cleaning and 
preprocessing, 2) policy simulation, and 3) policy evaluation. The 
data cleaning and preprocessing step consists of clustering student 
transaction data into meaningful class sessions based on meta-data 
(e.g., student transaction timestamp, classes), and examining the 
distribution of students per class session to detect outliers. The 
policy simulation step takes the preprocessed transactional data 
and applies a pairing policy to class sessions. In the policy 
evaluation step, we computed the policy feasibility based on the 
simulation results, using the corresponding feasibility index (FI, 
or FI,). We also observed how the FI changed by varying the 
parameters (i.e., KD, and PR rate). 


4. ANALYSIS AND RESULTS 


4.1 Data Cleaning and Preprocessing 

We first clustered student transaction data into meaningful class 
sessions, based on timestamp, student ID, and class. We 
visualized student engagement for all class sessions based on 
transaction data, which allowed us to ensure that the sessions we 
analyzed had a continuous student interaction with the system, 
and helped us check for outliers (e.g., unusually short sessions). 
We excluded four outlier sessions: 2 sessions that had only 1 
student, 2 sessions that lasted less than 15 min, as sessions 
commonly lasted 40 minutes or more. 


Transaction data of a total of 68 sessions, from six middle school 
math classes, collected from 2013 to 2014 were used for policy 
simulation. It consists of 894 students and 197,234 rows of 
transactions. The average number of students in a session was 13 
(Min = 5, Max = 24, SD = 25.3); the average duration of class 
session was 41.9 minutes (Min = 10, Max = 81, SD = 9.42); the 
average number of sessions in a class was 11 (Min = 3, Max = 23, 
SD = 9.33). 


4.2 Overall SimPairing Analysis 

In this section, we present, for each policy, the SimPairing 
analysis and the results. The goal for this analysis was to evaluate 
the overall feasibility of the four pairing policies (RQ1) and see 
how the feasibility depends on policy parameters (RQ2). 


4.2.1 Policy 1: Struggle with Non-Struggle 

We simulated Policy 1 for every minute in a given class session, 
which returned the number of struggling students who did or did 
not have a potential partner. Based on this we calculated the FI, 
for every minute in a class session. We then averaged FI, across 
the length of each class session, to obtain an average FI, for a 
given session. We refer to it as the Average Number of Struggling 
Students (ANSS). We then took the average of the ANSS across 
all sessions, to obtain an overall simulation result for all 68 
sessions. Figure 2 (green area) shows the average FI, for all 
sessions was 0.94 (SD = 0.007). Thus, on average, across time, 
94% of struggling students could be paired with a partner who 
was not struggling. 


4.2.2 Policy 2: Pairing with Restriction 
The Policy 2 simulation process is similar to Policy 1, with the 
addition of enforcing a varying PR rate. PR rate specifies a 


percentage of students in Tutor Pool as restricted from partnering 
with students in the Struggle Pool. We computed FI, with varying 
PR rates. As shown in Figure 2 (white area), FI, dropped as the 
PR rate increased, as expected. However, even with a relatively 
high PR rate of, for example, 0.4, meaning, 40% of non-struggling 
students are restricted from working with struggling students, we 
still get a high average FI, of around 0.80, (i.e., 80% of struggling 
students could be paired). The simulation result means that 
teachers can afford to set moderate restrictions for pairings, 
without compromising too much of the pairing policy’s feasibility. 
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4.2.3 Policy 3: Knowledge Difference Pairing 


Policy 3 requires students to be above a given minimum distance 
in their IK to be eligible for pairing up. We simulated this policy 
by computing FI, with varying values for the KD distance 
threshold a. We simulated these sessions to calculate the FI,. As in 
Figure 3 (blue line), FI, dropped rather quickly as the required 
knowledge distance threshold went up. For example, the 
simulation results show that if we want to ensure an average of 
80% of paired ratio, the KD threshold should be set to less than 
approximately 0.1 (i.e., a very strict bar). 
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Figure 3. FI, for Policies 3 and 4 


4.2.4 Policy 4: Knowledge Similarity Pairing 

Policy 4 requires students to be below a maximum distance in 
their IK to be eligible for being paired up. We simulated this 
policy and computed FI, with different values for the KD distance 
ceiling y. We found that this policy would work well even with a 
low, strict ceiling for the knowledge distance (Figure 3, red line). 
For example, when y was 0.1, (i.e., two students’ knowledge 
distance can be at most 0.1 for them to be teamed up), the 
average FI, was still 0.81 (SD = 0.08) across the class sessions 
involved. When y was set to above 0.3, 95% of students in class 
could find an eligible partner. 
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4.3. Class-Level Variance Analysis 
We explored how the four pairing policies worked for different 


classes and whether the pairing policies should be adapted to 
class-level differences (RQ3). 


4.3.1 Class-level Differences 


Based on Echeverria et al.’s insight that pairing support for 
teachers should ideally be adaptable to different classroom 
contexts [30], we analyzed, first, if there were systematic 
differences between different classroom contexts, and second, if 
these differences relate to policy feasibility differences. The main 
context variables taken into account by our pairing policies are 
students’ struggle status (Policies 1 and 2) and initial knowledge 
(Policies 3 and 4). We thus analyzed if the classes had different 
struggle statuses and initial knowledge (IK). 
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Figure 4. ANSS (a) and ASSRs (b) for Six Classes 


Student struggle status: We first calculated the number of 
students in wheel-spinning status for every minute within each 
class session. We then computed ANSS (section 4.2.1), and the 
Average struggling students ratio (ASSR)= ANSS / total number of 
students in the session. Histograms for ANSS and ASSR for all 
sessions show they follow the normal distribution. We conducted 
one-way ANOVAs, respectively taking the ANSS and ASSR as 
outcome variables and Class as the explanatory variable. The 
results showed a significant difference for ANSS among classes 
(Figure 4, a) [F(5,62) = 4.34, p < 0.001]. Post hoc Tukey tests 
showed C3 and Cl have significant differences (diff = 2.55, p < 
0.001). All post hoc pairwise tests conducted in this study were 
corrected for multiple comparisons. The ANOVA result indicated 
that the classes differed with marginal significance [F' (5,62) = 
1.94, p < 0.1] (Figure 4, b). Post hoc Tukey tests showed a 
marginal difference in the ASRR between C3 and C2 (diff= 0.10, 
p = 0.08). Thus, there were class-level differences with respect to 
students’ struggle status. 
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Figure 5. Initial Knowledge for Six Classes 


Initial Knowledge: We calculated each student’s IK for all KCs 
involved (defined in section 3.1.3). The histogram for all students’ 
IK shows it follows the normal distribution. We then conducted 
one-way ANOVAs using /K as the outcome variable, and Class as 
the categorical explanatory variable. The results indicated a 
significant effect of classes on IK for the six classes [F(5, 320) = 


5.895, p < 0.05], and the IK for the six classes were not all equal. 
From the post hoc Tukey tests comparing knowledge level 
between each pair of the classes, we saw significant differences 
between classes C2 and Cl (diff = 0.12, p < 0.05), C3 and C2 (diff 
= -0.095, p < 0.05), and C4 and C2 (diff = -0.189, p < 0.05). C2 
had the highest median of student IK (Figure 5), and a 
significantly higher level of IK than C1 and C3, and C4. 


Having characterized struggle and IK at a class level, we compare 
the policies’ feasibility across classes. 


4.3.2 Policies 1 and 2 

Policy 1 had an average FI, above 0.85 (Figure 6, green area). We 
statistically compare if Policy 1 behaved differently for each class 
and see whether this policy should be adaptable for each class. 
Using session as the unit of analysis, we conducted a one-way 
ANOVA using the FI, for each session as the outcome variable, 
and Class as the categorical explanatory variable. The results 
indicated that there was not a significant effect of class on FI, 
[F(5,62) = 1.24, p = 0.30]. This result showed that Policy 1 was 
relatively consistent across the six classes, suggesting that Policy 
1 may not need to be adaptable to classes. 


For Policy 2, with increasing PR rate, the FI, decreases at a 
different speed for different classes, indicating some degree of 
class-level difference (Figure 6, white area). We conducted 
ANCOVAs with Class being the categorical explanatory variable, 
the PR rate as the quantitative explanatory variable, and FI, being 
the quantitative outcome variable. We first compared the model 
with and without a Class x PR rate interaction term. The model 
comparison result showed no evidence of an interaction effect 
among explanatory variables (F = 1.63, p = 0.15). We thus 
perform ANCOVA using an additive model. Results indicated 
there were eight pairs of classes that had significant differences in 
FI, for this pairing policy (p < 0.05). The eight pairs were C1-C2, 
C1-C3, C1-C6, C2-C3, C2-C6, C3-C5, C4-C6, and C5-C6. 
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Figure 6. FI, of Policies 1 and 2 for Six Classes 


Next, we looked at possible relations between the class-level 
feasibility variance of policies 1 and 2, and the class-level 
differences in struggle status (section 4.3.1). We found that for 
classes that differed with respect to the number of struggling 
students (C3-C2) and the average ratio of student struggle 
(C3-C1), the feasibility of Policy 2 tended to differ as well. This 
finding suggests that 1) Policy 2 may benefit from being adaptable 
to class-level characteristics, and 2) variables characterizing a 
class’s struggle status (e.g., ANSS and ASSR) may have value in 
indicating how Policy 2 should be adaptable. On the other hand, 
the feasibility of Policy 2 was different in Class 6 compared to all 
other classes except C3, yet Class 6 did not differ in number or 
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ratio of struggle students from other classes. Thus, students’ 
struggle status alone may not provide enough information to fully 
decide whether and how P2 should be adaptable. 


4.3.3 Policies 3 and 4 

For Policy 3, the classes shared a downward trend in FI, with 
different slopes for each class (Figure 7). For example, when the 
KD was 0.1, we saw the FI, values for class 6 (green dotted-line) 
drop to as low as 50%, but the other five classes have FI, above 
75%. This shows that policy feasibility may be differently 
affected by the knowledge heterogeneity threshold in each class. 
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Figure 7. FI, of Pairing Policy 3 for Six Different Classes 


To test whether Policy 3 behaved differently for each class, we 
conducted ANCOVAs with Class as a categorical explanatory 
variable, KD as a quantitative explanatory variable, and FI, in 
each session as the outcome variable. We first compared the 
model with and without a Class x KD interaction term. Results 
indicated no evidence supporting the interaction effect (F = 0.81, 
p = 0.54). We performed an ANCOVA using an additive model. 
Results indicated that three pairs of classes had significant 
differences in FI, (p < 0.05), and that two pairs of classes were 
marginally different (p < 0.1). They were C1-C3, C3-C5, C3-C6 
(p < 0.05) and C2-C5, C2-C6 (p< 0.1). 
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Figure 8. FI, of Policy 4 for Six Different Classes 


For Policy 4 (Figure 8), the six classes were more convergent and 
clustered closer together than Policy 3 (Figure 7). This indicated 
the class level difference may not be as strong as that in Policy 3, 
which our ANCOVA tests confirmed. Similar to Policy 3, we 
compared whether Policy 4 behaved differently for each class. We 
conducted an ANCOVA, with Class as a categorical explanatory 
variable, KD as a quantitative explanatory variable, and FY, as the 
outcome variable. We first compared the model with and without 
a Class x KD interaction term. No evidence supporting interaction 
effect among explanatory variables (F = 0.13, p= 0.99). We then 


performed ANCOVA using an additive model. Results indicated 
that there were no significant differences in FI, (p > 0.05) for 
Policy 4. We confirmed a smaller class-level difference as 
compared to Policy 3, in KD’s effect on policy feasibility. From 
this result, we conclude Policy 4 performed quite consistently 
across classes, and no significant evidence showed that Policy 4 
should be adaptable to classes. 


Analogous to policies 1 and 2, we then looked at relations 
between feasibility variance for policies 3 and 4 and class-level IK 
characteristics in Section 4.3.1. We observed significant 
differences in IK between C2-Cl, C3-C2, and C4-C2. However, 
the differences in IK for two classes cannot accurately predict 
whether they had different feasibility in Policy 3 and Policy 4, and 
other classroom characteristics may be needed to accurately 
represent the class-variance of feasibility. 


4.4 Analysis of Contrasting Cases 

We conducted a case study to understand how policies may 
perform dynamically (e.g., across every minute during class time) 
and differently in different class sessions (RQ3). For every policy, 
we selected a typical case and an extreme case in terms of the 
policy feasibility simulation results. For the typical case for all 
four policies, we selected a session (Session 1, C1) that had an 
average length of time (i.e., 41 minutes), an average number of 
students (i.e. 13 students). In the session, policies performed 
typically (as by visually comparing the simulation results of each 
policy for all sessions). As for the extreme case, we examined the 
simulation results for each policy on each session, and identified 
different sessions where each policy performed surprisingly or 
differently from the common trend. The extreme case can be a 
worst case scenario (Policies 1, 2 and 3) or a case that works 
surprisingly well (Policy 4). Below, we present the analysis and 
results for these contrasting cases for each policy. 


4.4.1 Policy 1 

In Policy 1, we chose the extreme case (Session 19, C3) as it was 
a session that this policy has the worst performance on, and thus it 
had the most different FI, trend, from examining visualizations of 
FI, for all sessions involved. We compare the typical case and 
extreme case by first contextualizing the struggle status of the two 
cases, and comparing the visualization of feasibility (for each 
minute) in the two sessions. Figure 9 depicts the ratio of 
struggling students (among all students in the class session) for 
the contrasting cases. For Policy 1 simulation (Figure 10), we 
obtained, for every minute in the class session, three values 
regarding policy feasibility: the number of students who were not 
wheel-spinning on any KCs (green bar), the number of students 
who were struggling, and had a potential partner (yellow bar), and 
the number of students who were struggling and did not have a 
potential partner (red bar). 
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Figure 9. Struggle Ratio of a Typical (a) and Extreme (b) Case 
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Typical Case. In the typical case, students started to struggle after 
10 minutes, as shown in Figure 9 (a) and Figure 10 (a). In all 
instances of wheel-spinning, a potential partner was available (i.e., 
FI, = 1 for any minute in this session). The typical case aligned 
with the overall simulation results from Policy 1, which showed 
that, for an average class, most struggling students could find 
potential partners, the minute they struggled. 


Extreme Case. Among all sessions in our dataset, the per-minute 
struggle ratio rarely goes over 50%. By contrast, the extreme case 
session had more struggling students than non-struggling students 
in 27 out of 46 minutes, indicated by a struggle ratio of above 0.5, 
as shown in Figure 9 (b). This resulted in lower feasibility for 
Policy 1. The extreme case differs from the typical case in two 
aspects. First, unlike the typical case, almost as soon as the class 
began, students started wheel-spinning. Second, there were 
wheel-spinning students without potential partners in almost every 
minute of the session (indicated by red bars in Figure 10 (b)). 
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Figure 10. Policy 1 for a Typical (a) and an Extreme Case (b) 
4.4.2 Policy 2 


Same as in Policy 1, we chose this extreme case (Session 19, C3) 
as this policy had the worst performance on this session. This 
session also had the most different FI, trend. We simulated Policy 
2 and calculated FI, for every minute in the two contrasting class 
sessions. Figure 11 showed the typical and extreme case, of how 
FI, changed when different PR rates were simulated. We plotted 
four different PR rates in the figure. 


Typical Case. We saw two patterns in the Policy 2 simulation for 
the typical case. Firstly, the policy was typically robust in 
maintaining high feasibility with a non-zero (albeit low) PR rate. 
In Figure 11(a), lines with PR rate 0.1 and PR rate 0 completely 
overlapped. With these PR rates, there were no instances of 
struggle without a potential partner (i.e., feasibility was 1 across 
the whole session). Secondly, when the PR rate was high (0.5 or 
0.8), FI, exhibited a sharp decrease, when there was an increase in 
student struggle. For instance, in Figure 9 (a) at minute 19, the 
struggle ratio increased from 0.07 to 0.23, as the number of 
wheel-spinning students went from | to 3. In Figure 11 (a) at the 
same time (t = 19 min), we saw a sharp decrease in FI, when the 
PR rate was 0.8. 


Extreme Case. As shown in Figure 11 (b), the extreme case 
exhibited very different patterns compared to the typical case, 
mainly in three aspects. First, given it had a higher struggle ratio, 
even when there was no pairing restriction (i.e., PR rate = 0), we 
observed the FI, was not always | or even close to 1, as we saw in 
the typical case. Second, even a slight PR rate of 0.1 further 
worsened the policy feasibility and lowered the FI,, unlike the 
typical case which showed resistance to a low PR rate. Third, if a 
class had a higher struggle ratio, the PR rate had a stronger effect 


on worsening FI, than for a session that had a lower struggle ratio. 
This effect was especially prominent when the PR rate was high 
(e.g., 0.5 or 0.8). This contrast means that the instructors may 
afford to set a higher PR rate without affecting the FI, too much, 
for a common session that has a moderate struggle ratio. 
However, the instructors may need to consider lowering the PR 
rate for a high-struggle session. 
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Figure 11. Policy 2 for a Typical (a) and Extreme (b) Case 
4.4.3 Policy 3 


In Figure 12 we present the results for Policy 3 simulation on two 
contrasting cases, plotting FI, for every step of the knowledge 
distance threshold for that session. The extreme case was chosen 
for having the most different FI, trend, from examining 
visualizations of FI, for all sessions involved. 


Typical Case. As shown in Figure 12 (a), for the typical case, the 
FI, dropped gradually as the required KD threshold increased, 
which aligned with the overall simulation result. To pair students 
based on different knowledge (Policy 3), the instructors need to 
balance the required heterogeneity (i.e., higher knowledge 
distance threshold) and the desired paired ratio of the whole class. 
In this typical case, if a teacher selects a threshold of 0.5 or 
higher, none (0%) of students in the class session would be paired. 
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Figure 12. Policy 3 for a Typical (a) and Extreme (b) Case 
Extreme Case. As shown in Figure 12 (b), for the extreme case 
(Session 1, C5), while the downward trend was similar, we 
observed a more rapid decrease as compared to the typical case. 
Specifically, the FI, dropped to only 20% when the KD threshold 
was as low as 0.2, compared to 60% of FI, at the same KD 
threshold in the typical case. This comparison indicated that some 
class sessions were more heavily influenced by the parameter of 
the required knowledge distance threshold, and the effect may 
differ from session to session. 


4.4.4 Policy 4 


From the previous analyses, we noted that Policy 4 performed 
reliably and similarly across classes, making it harder to select an 
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extreme case or a worst case scenario. We selected a session 
where Policy 4 performed surprisingly well (Session 2, C3). In 
Figure 13 we visualized Policy 4 simulation on two contrasting 
cases, plotting FI, for every step of the knowledge distance ceiling 
y for that session. 


Typical Case. The tradeoff between knowledge homogeneity and 
policy feasibility was less prominent than under Policy 3. This 
means that instructors can afford to choose a stricter (i.e., lower) 
ceiling so students have a very small knowledge distance, and still 
achieve high feasibility (FI,). For example, in Figure 12 (a), we 
saw that even if the instructor chooses a very strict threshold of y 
= 0.1, nearly 95% of students were able to find a potential partner. 
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Figure 13. Policy 4 for a Typical (a) and Extreme (b) Case 


Extreme Case. In Figure 13 (b), even when the KD ceiling was 
set to 0 (which means students must have the same level of 
mastery to be paired up), 40% of the students can still be paired, 
unlike typical cases where usually no two students have the exact 
same IK. Another noteworthy distinction is, while the typical case 
did not reach FI, = | with any ceiling of KD, the extreme case 
successfully paired all students (FI, = 1) with a relatively low 
ceiling of 0.2. 


5. DISCUSSION 


In line with previous LGF research [8], this work introduces four 
dynamic LGF policies contextualized in ITS and grounded in user 
research with K-12 teachers [10,30]. In this section, we discuss 
our main findings for research questions, grounded design 
recommendation for pairing orchestration tools, and future 
research direction for dynamic LGF. 


5.1 Main Findings for Research Questions 
Regarding the feasibility of the pairing policies (RQ1) and how 
the feasibility may depend on parameters of the pairing policies 
(RQ2), we found that averaged across time and sessions, it is 
generally feasible (93.6%) to team up struggling students with 
non-struggling students, the minute they struggle (Policy 1). This 
result remains true even when a high percentage of students is 
deemed ineligible for being teamed up with struggling students 
(Policy 2). Specifically, the average feasibility remains above 80% 
of struggling students across all sessions unless the pairing 
restriction rate is above 40%. However, as we see in our case 
study, there can be sessions and moments with high struggle ratios 
(hence, low feasibility) when using Policy 1. Relatedly, sessions 
with very high struggle seem more susceptible to the influence of 
the PR rate in Policy 2 than a typical session. 


When pairing students based on whether their knowledge levels 
are different (Policy 3) or similar (Policy 4), the policy feasibility 
is highly dependent on the required KD. For Policy 3, there is a 
tradeoff between the desired heterogeneity (i.e., the knowledge 
distance threshold) and the policy’s feasibility. This means 
instructors cannot set a high threshold for the KD if they want to 
pair most students. In Policy 4, the corresponding tradeoff 


(between homogeneity in knowledge and policy feasibility) is less 
prominent. Instructors may choose a stricter ceiling for students’ 
similarity in knowledge levels and still achieve high policy 
feasibility. In the case study, we found that the policy feasibility in 
different sessions can be differently influenced by the required 
KD threshold or ceiling, depending on how closely clustered 
together students’ IK is. For a given, fixed KD threshold (ceiling), 
a class of students closely clustered IK may result in higher 
feasibility for Policy 4 and lower feasibility for Policy 3. 
Presumably, the feasibility of these policies also depends on class 
size. For example, from our analysis, we hypothesize that for 
larger classes, the feasibility of pairing policies may change less 
drastically, when the policy parameters change or as the class 
progresses. 


Regarding policy feasibility by class (RQ3), our results show no 
significant difference among classes for Policy 1 (Struggle with 
non-struggle) or Policy 4 (Knowledge similarity pairing). 
However, we observed significant differences among classes for 
Policy 2 (Pairing with restriction) and Policy 3 (Knowledge 
difference pairing). Although different classes have significantly 
different initial knowledge and struggle status, these differences in 
IK and struggle status are not always correlated with the 
feasibility of policies for that class. For example, classes that have 
different IK may not always have different feasibility for Policy 3 
or 4. 


5.2 Recommendations for Tool Design 

The current study aims to inform the design of an orchestration 
tool that can help pair students dynamically. We aim to lessen 
teachers’ orchestration load when managing fluid social 
transitions. Such a tool plays a key role in our vision for the smart 
classroom of the future, in which students alternate fluidly 
between individual and collaborative learning. Here, we highlight 
three design implications grounded in findings from the current 
work. These design implications may inform tools that aim to help 
teachers manage fluid social transitions, and ensure the feasibility 
of dynamic LGF policies. It may also offer inspirations, more 
broadly, for orchestration tools that aim to team up students in 
CSCL. 


Firstly, technology could be used to automatically adjust the 
parameters used in LGF policies. Our study suggests that the four 
pairing policies studied provide a promising foundation for an 
orchestration tool, but greater flexibility is needed to deal with a 
wide range of circumstances than each individual policy provides. 
While some policies (e.g., Policies 1 and 2) explored in this study, 
have a good chance of working well during many class sessions, 
any given instantiation of a policy (with fixed parameter settings) 
does not fully deal with class variability and extreme cases. One 
way to compensate might be to have the tool automatically loosen 
policy parameters as needed. For example, the tool may gradually 
loosen the KD threshold or ceiling for policies 3 and 4, when it 
senses the pairing feasibility to be low. 


Secondly, technology could use multiple LGF criteria in 
cascading fashion, to achieve high feasibility. Specifically, the 
tool may start out using the ideal pairing policies, and then 
iteratively try “more loose” criteria if the previous one fails to pair 
up all students. For example, the tool may first attempt to team up 
students based on struggle on specific KCs - a criterion that is 
more specific (and restrictive) than Policy 1, but one that could 
potentially be more effective for helping struggling students. If 
that fails, then it might pair up students based on their general 
struggle (Policy 1). If that fails again then the tool could try to 
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pair students based on knowledge distance. The tool could also 
customize its pairing criteria such as using students characteristics 
that make the most sense in the given classroom context. 


Lastly, technology could be used to recommend LGF policies or 
policy parameters with high feasibility to teachers. Instead of 
relying solely on the teachers to make pairing decisions, the tool 
may adopt SimPairing to automatically calculate and maximize 
policies’ feasibility based on classroom contexts and recommend 
them to teachers. For example, if the tool determined, using 
historical transaction data, that students in a given class have 
consistently low struggling ratios and fewer wheel-spinning 
students than non-wheel-spinning ones, it may advise that 
teachers adopt pairing Policy 1 as it has high feasibility. In 
addition, our findings open up the potential for the tool to help 
teachers make informed decisions about parameter configuration, 
by notifying them of expected feasibility. For example, if a 
teacher severely restricts the acceptable pairings, the tool could 
alert teachers of the low feasibility of the pairing policy, and ask if 
the teacher might want to loosen the restrictions. Running such 
simulations and providing notification can inform the teachers 
about the outcomes of the policy feasibility in their classroom, 
prior to implementing them. This may prevent teachers from 
choosing policy configurations that are misaligned with their 
goals. 


5.3 Future Research Directions of Dynamic LGF 
Building on our investigation, we outline four potential directions 
for how future research could further explore dynamic LGF 
policies. 


Firstly, in addition to inter-session pairing based on knowledge 
distance (Policies 3 and 4), future work could explore 
intra-session grouping based on knowledge level, which allows 
forming pairings during the learning process. Researchers should 
explore how it would differ from our inter-session grouping and 
which approach better supports teachers’ needs. 


Secondly, the current policies identify the students to be 
wheel-spinning if they struggle on any of the KCs. Future work 
could explore whether teachers prefer to pair students based on 
their KC-specific struggle status. For example, to help a student 
struggling on the KC combine constant terms in equation solving, 
teachers may prefer to find a partner who has already mastered the 
same KC, or at minimum is not struggling on the same KC; they 
may (or may not) might find it acceptable, if the partner is 
struggling on another KC, e.g., divide by variable coefficient. 
Relatedly, teaming up students who are both struggling, but 
struggling on different knowledge components, may have 
benefits. Such a pair of students may have complementary 
knowledge and strength, and may help each other get unstuck and 
stop wheel-spinning. Such pairing criterion opens up good 
opportunities for tutor-tutee role-switching and mutual peer 
tutoring. 


Thirdly, analogous to pairing based on KC-specific struggle 
status, instead of using the mean of students’ mastery on different 
KCs to represent their knowledge, future work could explore to 
what extent pairing students based on KC-specific knowledge 
distance can be more effective, feasible, or preferable for teachers. 
KC-specific knowledge pairing might be useful for Policy 3, if 
teachers want two students who have very different skill levels on 
one specific KC so that the one with higher mastery on that KC 
can tutor the one with lower mastery. 


Lastly, in addition to knowledge level and struggle status, which 
this work investigated for dynamic grouping, future work can 
investigate other student characteristics (e.g., history of 
collaborative episodes, preferences for working individually or 
collaboratively) or other sources for knowledge level (e.g., exams 
or quizzes score) for dynamic pairing. It may also be especially 
promising to further study pairing based on dynamic student 
behaviors that can be detected real-time by ITS from interaction 
data, to allow fluid social transitions and dynamic pairing. 


5.4 Limitations 

There is uncertainty in the SimPairing process in that we do not 
have a good way of estimating how long any given collaborative 
episode will last. Thus, SimPairing does not simulate students’ 
being unavailable for pairing while they are working 
collaboratively, until they finish the collaborative episode. There 
is some reason to think that the resulting inaccuracy in the 
feasibility results is not severe, as argued, but we do not have a 
good way of investigating that issue in depth. Additionally, 
feasibility of pairing policies, while important, is just one piece of 
the puzzle. It is important, as well, to understand if students /earn 
better with these pairing policies (effectiveness). Future research 
should validate these pairing policies in classroom studies, testing 
both their effectiveness and feasibility. 


6. CONCLUSION 


We study the feasibility of pairing policies in the context of ITS, 
to inform the design of a tool for orchestrating fluid transitions 
between individual and collaborative learning. Our findings show 
that on average, dynamically pairing students based on their 
in-the-moment wheel-spinning status results in good pairing 
feasibility for struggling students on average, even with moderate 
restrictions on the allowed pairings. We also found the trade-off 
between the required knowledge distance and the policy 
feasibility, is more prominent in heterogeneous grouping than in 
homogeneous grouping. However, any given instantiation of a 
policy (with fixed parameter settings) does not fully deal with 
class variability and extreme cases, as policies have different 
feasibility for different classes and sessions. This suggests 
optimization for policy feasibility (e.g. through gradually 
loosening parameters) or classroom customization need to be 
taken into consideration. Methodologically, this research extends 
previous work (e.g., Replay Enactments) that used authentic data 
and algorithms as design materials to augment designers’ 
intuitions for designing future tools [27]. 


This work has several novel elements. First, using the SimPairing 
approach, our work explores the feasibility of LGF policies 
derived from user research with math teachers. In addition, to the 
best of our knowledge, this is the first study that considers 
students’ in-the-moment wheel-spinning status in dynamic pairing 
policies. Finally, our work addresses a gap in the literature for 
dynamic intra-session LGF [1] and envisions how instructors 
and/or an orchestration tool will customize pairing policies and 
parameters to specific classroom contexts, which prior work 
argued to be especially helpful in the LGF process [1,8]. 


In sum, theoretically, this work bridges the literature gap on its 
investigation of the feasibility of user-centered dynamic pairing 
policies. Practically, we contribute grounded design directions for 
pairing orchestration tools, and SimPairing as an approach, to 
evaluate dynamic LGF policies, which may generalize to other 
online educational software that have transaction data. 
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ABSTRACT 


The state of the art knowledge tracing approaches mostly 
model student knowledge using their performance in as- 
sessed learning resource types, such as quizzes, assignments, 
and exercises, and ignore the non-assessed learning resources. 
However, many student activities are non-assessed, such as 
watching video lectures, participating in a discussion forum, 
and reading a section of a textbook, all of which poten- 
tially contributing to the students’ knowledge growth. In 
this paper, we propose the first novel deep learning based 
knowledge tracing model (DMKT) that explicitly model stu- 
dent’s knowledge transitions over both assessed and non- 
assessed learning activities. With DMKT we can discover 
the underlying latent concepts of each non-assessed and as- 
sessed learning material and better predict the student per- 
formance in future assessed learning resources. We compare 
our propose method with various state of the art knowledge 
tracing methods on four real-world datasets and show its ef- 
fectiveness in predicting student performance, representing 
student knowledge, and discovering the underlying domain 
model. 


Keywords 

Knowledge Tracing, Multiple Learning Resource Types, Non- 
Assessed Learning Resources, Memory Augmented Neural 
Networks, Domain Knowledge Modeling, Student Knowl- 
edge Modeling 


1. INTRODUCTION 


As the education landscape shifts toward distance learning, 
the online learning systems advance in complexity and ca- 
pacity. They can handle more students, evaluate students 
through different kinds of assessments, and offer various 
types of learning resources to them. In such systems, a 
student can study a reading section, take a quiz, watch a 
video lecture, and practice programming in an embedded 
development environment. As a result, students learn from 
heterogeneous types of activities in modern online learning 
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systems, among which some can be assessed and some can- 
not. 


Despite this heterogeneity in learning resource types, current 
student knowledge tracing models mostly focus on assessed 
learning resources, ignoring the non-assessed ones. In the 
assessed learning resource types, such as quizzes and assign- 
ments, students’ performance can be evaluated given their 
answers and solutions. These kinds of learning resources 
provide a window to student knowledge through observing 
their performance. Conversely, in the non-assessed learning 
resources, such as readings and video lectures, such an ob- 
servation does not exist. Hence, evaluating student knowl- 
edge and performance while interacting with these learning 


resources is a difficult task [10]. 


Indeed, because current knowledge tracing approaches do 
not model non-assessed learning resources, identifying their 
underlying concepts, finding the similarities between these 
learning resources, and in general domain knowledge mod- 
eling for such non-assessed learning materials is still a chal- 
lenging problem. That is, many modern knowledge tracing 
models do not rely on a predefined domain knowledge model, 
such as a Q-matrix, and can identify the “latent concepts” 
that are being evaluated in problems, quizzes, or assign- 
ments [9]. This is particularly useful when 
annotating learning materials with their concepts is expen- 
sive or infeasible. However, discovering such latent concepts 
in non-assessed learning resources is an under-explored re- 
search area. Some recent works have aimed in identifying 
such latent concepts and similarities between as- 
sessed and non-assessed learning materials. However, their 
findings were according to static student performance, ig- 
noring the sequential learning data of students. 


In this paper, we argue that modeling non-assessed learning 
materials is essential and non-dispensable in tracing student 
knowledge. Students learn from all types of activities and 
ignoring a large portion of student activities is a missed op- 
portunity in student knowledge tracing. Especially that pre- 
vious research has shown that working with various learning 
activity types has considerable benefits for student learn- 
ing [12]. Hence, modeling both assessed and non- 
assessed learning activities should result in a more accurate 
estimation of student knowledge state and prediction of their 
performance on future assessed learning resources. 


Accordingly, we propose Deep Multi-type Knowledge Trac- 
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ing (DMKT) model, which not only traces student knowl- 
edge states over various learning activity types but also pro- 
vides a feasible solution to discovering underlying patterns 
or concepts for both assessed and non-assessed learning re- 
sources. To this end, DMKT estimates student knowledge 
gain between every two consecutive assessed learning activi- 
ties according to student performance on them. At the same 
time, it distributes this estimated knowledge gain among the 
in-between non-assessed learning activities and the latest as- 
sessed activity. We use an attention mechanism for this dis- 
tribution. As a result, DMKT can model the underlying 
latent concepts for each of the assessed and non-assessed 
learning resources, evaluate student knowledge after inter- 
acting with these learning resources and predict student per- 
formance on the assessed ones. 


We evaluate our proposed model on four real-world datasets, 
showing the significant effect of modeling various learning 
resource types on the task of student performance predic- 
tion. Also, we showcase the interpretability of DMKT by 
visualizing student knowledge while working with various 
learning resource types. Finally, we demonstrate the power 
of DMKT in discovering the learning resources’ similarities 
and underlying latent concepts. 


2. RELATED WORK 


Student knowledge tracing aims to capture the student’s 
knowledge state and knowledge state transition patterns, 
which could be further used for tasks like students’ perfor- 
mance prediction, intelligent curriculum design, and inter- 
pretation and discovery of structure in student tasks. 


Traditional knowledge tracing methods modeled knowledge 
transition on assessed learning resources using predefined 
domain knowledge models (concepts of learning resources). 
For example, Drasgow et al. proposed IRT that leverages 
the structured logistic regression to model student’s dichoto- 
mous responses and estimates the student’s ability, learning 
resource difficulty [3]. BKT uses binary variables for mod- 
eling whether student acquires a concept or not, and a Hid- 
den Markov Model is used to update the probability that 
student answers a question correctly [6] [22]. However, since 
annotating a domain knowledge model can be expensive and 
time consuming, in many real-world scenarios, such prede- 
fined domain knowledge models are not be provided. To 
solve this problem, new approaches turn to investigate mod- 
eling student knowledge and domain knowledge at the same 
time. For example, Lan et al. utilized the matrix factoriza- 
tion to model the student knowledge and concept-question 
association, assuming the sparse association between con- 
cepts and questions [14]. As another example, Doan et al. 
model student learning with a tensor factorization in which 
the student knowledge is having an increasing trend using a 
rank-based constraint [7]. 


At the same time, in the past few years with the advance 
of deep neural networks, deep knowledge tracing methods 
have emerged. For example, DKT utilizes LSTM to 
model students’ knowledge transition over time. Recently, 
transformer-based neural networks have been successfully 
applied to model the different knowledge transitions of dif- 


ferent students’ historical interactions on learning resources 


[9]. SAKT uses the self-attention mechanism to model 


the interdependencies among interactions on the sequence. 
In [23], Zhang et al. proposed a Dynamic Key-Value Mem- 
ory Networks based method (DKVMN), which integrates 
the memory augmented neural networks with the attention 
mechanism, to exploit the relationships between underly- 
ing concepts for better students’ skill acquisition modeling. 
Yeung et al. extended DKVMN, by integrating the one- 
parameter logistic item response theory to provide better 
interpretability [21]. However, none of the deep knowledge 
tracing models have focused on modeling the non-assessed 
learning activities and tracing student knowledge on such 
activities. 


Knowledge Tracing using Multiple Learning Resource Types. 
Previous approaches ignored the effect of learning activi- 
ties on non-assessed learning resources, none of the methods 
mentioned above consider both assessed and non-assessed 
learning resources at the same time. However, in reality, 
students not only learn from practicing assessed learning 
resources (such as questions) but also learn by studying 
the non-assessed one, such as watching video lectures, read- 
ing textbooks, and discussing with others. One reason for 
not modeling the non-assessed activities is that reliable stu- 
dent performance observations are missing in these activi- 
ties. This makes modeling the knowledge transition from 
these non-assessed learning activities difficult. To the best 
of our knowledge, the only existing work that models non- 
assessed learning activities along with the assessed ones is 
Multi-View Knowledge Model (MVKM) [25]. MVKM mod- 
els multiple learning resources jointly using tensor factoriza- 
tion to capture latent students’ features and latent learning 
resource concepts, assuming that latent concepts are shared 
by different learning resource types. However, this method 
can only capture the linear dependencies between variables, 
as the latent students’ features and latent learning resource 
concepts are multiplied via linear matrix and tensor prod- 
ucts. On the other hand, due to the large memory cost of 
tensor factorization, MVKM can not handle the datasets 
with very large student and learning resource numbers. Un- 
like MVKM, our proposed method in this paper considers 
the non-linear relationships between variables, and handles 
large datasets, while modeling student knowledge gain from 
multiple learning resource types (both assessed and non- 
assessed). 


3. DEEP MULTI-TYPE KNOWLEDGE TRAC- 


ING (DMKT) 


3.1 Problem Formulation 

A standard knowledge tracing (KT) problem is to predict 
student performance or response on an upcoming question, 
given the learner’s performance records on previously solved 
questions. These records typically consist of a sequence of 
questions and responses at each discrete time step, denoted 
as a tuple (q/,rj) for student s at time step t. Since we 
only discuss how to predict future performance for a single 
student, we omit the superscript s in the following sections. 
Therefore, given students’ past history records up to time t— 
las {(q1,11),--: , (qt-1,7t-1)}, the goal of KT is to predict 
their response r; to question question q: at the current time 
step t. 


In this paper, we aim to incorporate students’ non-assessed 
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learning activities and model student knowledge transition 
over both assessed and non-assessed learning resources, such 
as solving quizzes, watching video lectures, viewing anno- 
tated examples or hints, and participating discussion fo- 
rums. Therefore, given student’s past historical responses 
to assessed learning materials as well as past history of non- 
assessed learning activities, we would like to estimate stu- 
dent knowledge and predict their performance in the next 
assessed learning resource. To do this, assuming L distinct 
non-assessed learning resources and @ distinct assessed ones, 
we represent students’ historical records up to time t — 1 
as {(q1,71), £1, (q2, 72), £2,°°* , (Ge-1, re—-1), £t-1}, in which 
Li = {li,l2,---,..12} at each time step t denotes the se- 
quence of n non-assessed learning activities (e.g., watching 
video lectures) between the assessed activities (e.g., answer- 
ing questions) gq: and q+1. Our goal is to predict student 
performance on assessed learning material q at each time 
step t, model student knowledge at and between time steps 
in interaction with q and all lis, and discover the underlying 
latent concepts of assessed gs and non-assessed I's. 


3.2 The Base Model 

We base our DMKT model upon a recent successful deep 
knowledge tracing model: DKVMN [23]. DKVMN is a spe- 
cial type of memory-augmented neural networks (MANN) 
for knowledge tracing which has one static key matrix to 
store the knowledge concepts and one dynamic value ma- 
trix to store students’ updated mastery levels of those cor- 
responding concepts. Assuming that there are N latent con- 
cepts {cl,- 1c } for each learning resource, and each la- 
tent concept can be represented by d;,-dimensional embed- 
dings, similar to DK VMN, DMKT has the key matrix M* 
of size N x dp, to store the N knowledge concepts. Similarly, 
the value matrix M? of size N x dp, stores the student’s 
mastery levels of each concept, at time step t. 


However, DKVMN only supports updating knowledge states 
M? on assessed learning materials, and lacks the ability to 
leverage the abundant of data other than student responses 
on assessed learning materials. To overcome this limita- 
tion, our proposed DMKT updates M? with an additional 
internal component that employs the attention mechanism 
to process the non-assessed learning activities between any 
two assessed ones and use the updated M? to predict stu- 
dent’s performance on upcoming assessed learning resource. 
This component contains two functionalities, one is to up- 
date student knowledge state on non-assessed learning ac- 
tivities, and another is to summarize all activity contexts 
before an assessed activity to help accurate prediction of 
student performance. 


One may think that a straightforward solution to integrate 
the non-assessed learning resources would be to consider 
them as student interaction features. However, since the 
non-assessed learning activities are not explicitly represented 
in such models, their contribution to student knowledge 
could be assessed. Also, such an approach cannot model stu- 
dent’s knowledge transition between different non-assessed 
learning activities. In the following, we introduce our novel 
updating and summarizing functionalities that help DMKT 
to model all learning activity types. An overview of DMKT’s 


architecture can be found in Figure[iff] 


3.3 Learning Resource Attention Weights 

For the simplicity of illustration, let us assume that there 
is only one non-assessed learning activity, e.g., watching a 
video lecture, between solving two problems q—1i and q, 
that is £+-1 = {44-1}. DMKT assumes that student knowl- 
edge gets updated as the student interacts with l:-1 and q, 
weighted by their corresponding attention weights. So, in 
each step, DMKT uses attention weights from gq, and l,_1 to 
update the student knowledge in the concepts’ embeddings, 
M?. 


To compute the attention weights, DMKT first embeds all 
questions into an embedding matrix A‘? € ReX4n and all 
video lectures in another embedding matrix A! € R”™*%, 
At each time step, DMKT extracts the embedding vector 
of ge (ki € Ré) from A‘, as well as the embedding vector 
kK, € R® of 1 from A’. Then, it uses these embedding 
vectors to query the key memory matrix M* to obtain the 
attention weights w%(i) and w}(i) respectively as follows: 


w? (i) = Softmax (ki 'M*(i)) (1) 


wi_1(i) = Softmax (ki_.'M*(ji) (2) 


The attention weight in w? and w/_, can be viewed as re- 
spectively the correlation between question q and lecture 
ly_-1 with each of the N latent concepts. Notice that, w#(é) 
and w;_(i) are the i-th element in the attention weight 
vectors w? and w}_, respectively, and for interpretability 
purposes the attention weights sum to one ()~, w#(i) = 


os wi_1(Z) = 1). 


3.4 Student Performance Prediction 

At each time step t, DMKT aims to predict the student’s 
performance on gq. Since the predicted performance is a 
result of student knowledge that is gained by interacting 
with both problems and lectures, it is intuitive to aggre- 
gate these knowledge gains and predict the student perfor- 
mance accordingly. Remember that the memory value ma- 
trix M? € R**4» is used to represent student’s knowledge 
state on each concept embedding. So, to summarize the stu- 
dent’s mastery level of question q and lecture /;_, in the N 
concepts, we compute the weighted sum of all memory slots 
in the value matrix using attention weight vectors w/ and 
w!_,, respectively. 


aS x wi (i)M; (i) (3) 


rh = Suh (MF (i) (4) 


Then, we concatenate the latent knowledge states or mas- 
tery levels rf and rj_, on question q and lecture [;_1 with 
question embedding k? as well as lecture embedding ki, 


1 


The source code is provided at: |https://github.com/ 
persai-lab/EDM2021-DMKT 
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vertically and pass them into a fully connected layer with a 
Tanh activation to obtain a summary vector f; 


f, = "Tah (wi rf, ri _4,k%, Ki] of bi) (5) 


where [-] denotes concatenation. This summary vector f; 
contains a summary of all information, such as student abil- 
ity and the relationship between question q and lecture [¢_1, 
to predict student response at time ¢ accurately. Finally, the 
student’s performance in query question q is calculated by 
passing the feature vector f; through another fully connected 
layer with a Sigmoid activation as follows: 


pe = Sigmoid (W3 fi + bs) (6) 


3.5 Student Knowledge Update 

DMKT tracks the student knowledge states by updating the 
memory value matrix M/? after each learning activity on q@ 
and I; so as to predict student performance on q-+1 using 
the updated M?,1. 


For assessed learning activities, we first retrieve an embed- 
ding vector of (q:, rz), denoted by v7? € IR? , from a response 
embedding matrix B of size 2Q x d;,. This embedding v; 
contains the information about how much student knowl- 
edge should be updated after working on question q@ with 
outcome r?. We also use the erase-followed-by-add mecha- 
nism to update the memory value matrix, that is to erase the 
memory first using erase vector e? € [0,1] before adding 
new information with the add vector af € R®. This update 
of each value memory slot could be summarized as an erase 
step and an add step as follows: 


Erase Step: 
e? = Sigmoid (ET v3 + bg) 
M; (i) = Mz_1(é) @ [1 — wf @e?] 


Add Step: 


T T 
a? = Tanh (D vi + bi) S 
M? (i) = My_1(i) + wi (dat 


where 1 is a vector of all ones, and ® represents the element- 
wise multiplication. 


For each non-assessed activity, we follow a similar erase- 
followed-by-add steps in Eq. (7) and Eq. (8), except that we 
use k! directly instead of a new response embedding. 


Erase Step on Non-assessed Resources: 
e! = Sigmoid (H'k; + bi) 


Mi (i) = Mj_1 (i) ® [1 — wi(iel] 


Add Step on Non-assessed Resources: 


T 
1 Tat l 

= Tanh(G ki + bg 
at = ( t ) (10) 
My? (i) = Mz_1 (i) + wy (dat 


3.6 Network Architecture and Extension 

The neural network architecture of DMKT is shown in Fig- 
ure For illustration simplicity, this figure assumes that 
there is only one non-assessed learning resource |; between 
ge and qi+i1. This architecture mainly contains two com- 
ponents: read component for making a prediction on input 
question q and write component for updating the value ma- 
trix after interacting with I; and q@. 


When there are multiple non-assessed learning activities be- 
tween g: and qi41, that is £; = {l,--- ,I2}, we can simply 
extend the model by looping over each activity to generate 
k!' as well as r! using equation (4) fori € {1,--- ,n}. When 
making predictions, we use )>;"_, ki! to represent k} and 
paar rt to represent rj in the architecture. When updat- 
ing the knowledge, the value matrix is updated sequentially 
over all activities as described in the previous subsection. 


3.7 Training 

Alllearnable parameters , i.e. A?, A’, B, in the entire DMKT 
model are trained in end-to-end manner by minimizing the 
binary cross-entropy loss of all students’ assessed responses, 
ie., 


sce =—) > (orlogp: + (1—or)log(1—p)) (11) 


t 


where o; denotes the observation of correctness on assessed 
response at time t and p; denotes the prediction of correct- 
ness of DMKT at time ¢. 


3.8 Knowledge State Calculation 

DMKT is capable of tracing and depicting knowledge con- 
cept mastery level for each student. A student’s knowledge 
state before each assessed or non-assessed learning activity 
can be obtained in the read process using the following steps. 


Assume that there are N dummy query questions q's, each 
of them only using one concept, for the purpose of knowledge 
state calculation. Each of dummy questions can obtain a de- 
signed embedding k’ such that the correlation weight vector 
w’ is "one-hotted”, that is w' = [0,--- , wi,--- ,0] where w; 
of concept c’ is equal to 1. Then, we can use each of these 
one-hot correlation weight vectors to access value matrix 
state on each slot M?(i) to obtain rj for each concept c’. In 
other words, ri = M?(i) for q’. 


Then, we can predict the student knowledge purely based on 
r; by masking the weight of the input content embedding in 
Eq. (5), which ends up as: 


. 7 an ; , 
= "Teak ([ r ,0,0,0| frisrtas ki, ta] +b) (12) 


where W, is split into four parts including wi, wr = 0, 
wr = 0, and wr = 0. Finally, a scalar value p’ is output 
as in Eq. (6) to be the predictive mastery level of concept 
c’. We repeat this process N times with N numbers of one- 
hot correlation weight vectors to obtain student’s knowledge 
state vector with size 1 x N after each learning activity. 


4. EXPERIMENTS 
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Figure 1: Neural Network Architecture of DMKT. 


To evaluate our proposed model, we conduct three kinds 
of experiments. First, we compare it with state-of-the-art 
baselines in the student performance prediction task. Sec- 
ond, we analyze the discovered student knowledge transition 
patterns in terms of assessed and non-assessed learning ac- 
tivities. Last but not least, we validate the non-assessed 
learning resources’ latent concepts discovered by the pro- 
posed method. 


4.1 Datasets 

We use three real-world datasets to evaluate the proposed 
model: 

MORE?| is an open online course dataset from Coursera [3l. 
In this course, students can watch lecture videos and work 
on problems. Each problem is a full complex course assign- 
ment. These video lectures and assignments are published 
in sequential order in this dataset, but students can have 
multiple attempts on each assignment and watch any video 
at any time. Students’ scores are normalized into [0, 1]. 
EdNet’| is collected by Sant} a multi-platform AI tutoring 
service for students to prepare TOEIC English testing. We 
use the problem explanation documents as the non-assessed 
learning resources. There are 297,915 user records in the 
full dataset, and we randomly extract 1,000 users’ records 


“https: //educational-technology-collective.github. 


“https: //github.com/riiid/ednet 


https: //aitutorsanta.com/intro 


for experiments. 

Juny]’| is a dataset that comes from a Chinese e-learning 
website. Students work on problems from 8 math areas. 
Each problem has several hints, students can request hints 
when solving problems. We consider the problems as the as- 
sessed learning resources and the associated problem hints as 
the non-assessed learning resources. There are 25,925,922 
records in total from 247,606 users in the full dataset. We 
extract two subsets of this full dataset for experiments. One 
is called Junyi2063, which contains 2063 users’ records on 
3760 questions and 1432 hints. A smaller dataset named 
Junyil564, which consists of 1564 users’ records on 142 ques- 
tions and 116 hints, is extracted to serve the purpose of 
visualization on concept discovery results. The descriptive 
statistics of these four datasets are shown in the table[I] 


4.2 Baseline Methods 


In experiments of performance prediction, we compare with 
13 baseline methods on the task of student performance pre- 
diction on assessed learning resources, including six state-of- 
the-art deep learning based knowledge tracing models, one 
existing tensor factorization based knowledge tracing model 
supporting multiple learning resource types, and seven ex- 
tended deep learning based models utilizing non-assessed 
learning resources as additional input features. These meth- 
ods are: 


"https: //pslcdatashop.web.cmu.edu/Dataset Info? 
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Table 1: Descriptive Statistics of 3 Datasets. 


: Mean STD Correct Incorrect 
: Question ; : : : Non-gradable Non-gradable 
Dataset Users Questions Question Question Question Question : 
Records Materials Records 
Responses Responses Responses Responses 

MORF 686 10 12031 0.7763 0.2507 N/A N/A 52 41980 
EdNet 1000 11249 200931 N/A 118767 82184 8324 150821 
Junyil564 1564 142 120984 N/A 86654 34328 116 16389 
Junyi2063 2063 3760 290754 N/A 193664 97090 1432 69050 


e DKT (18): is a pioneer deep learning based knowledge 
tracing method that uses LSTM to model students’ 
knowledge transition over time. 


e DKVMN [23]: is a variant of memory augmented neu- 
ral networks that model the latent knowledge concept 
and dynamic student knowledge state over time. 


e DeepIRT [21]: is an extension of DKVMN that in- 
tegrates the one parameter logistic item response the- 
ory (1PL-IRT) to provide better interpretability, which 
could reduce the overfitting issue. 


e SAKT [16]: is an attention-based method that lever- 
ages the self-attention mechanism to model the inter- 
dependencies among interactions on the sequence. 


e SAINT [5]: is a transformer-based deep knowledge 
tracing method, two multi-head attention mechanisms 
are used to model exercise and response separately. 


e AKT {9}: is a variant of transformer-based deep knowl- 
edge tracing method that using a monotonic attention 
mechanism to model the different knowledge transition 
of students’ each historical performance on questions. 


In addition to those baselines that support assessed learning 
materials, we also compare our method with some baselines 
that either can leverage additional students’ non-assessed 
learning activities by design, or we modify them to consider 
such non-assessed activities as features of the assessed ones 
and predict students’ future performance. These methods 
are: 


e MLP-M: is a simple multi-layer perceptron that could 
take query question ID, user ID, and user’s 3 past his- 
torical records on current query question, as well as 
3 most recent non-assessed learning activities as in- 
put, and output a probability of user’s mastery level 
on query question. 


e DKT-M [24]: is an enhanced DKT model that could 
incorporate additional question features by concate- 
nating the feature embeddings with exercise response 
embedding as the input of vanilla DKT. 


e SAINT-M [5]: is a variant of SAINT that summing 
over all embeddings of non-gradable activities along 
with position encoding as the input of SAINT. 


e MVKM [25]: is state of the art method on modeling 
student knowledge transition over multiple learning re- 
source types based on multiview tensor factorization. 


Inspired by the DKT-M [24], we apply the same strategy to 
DKVMN to incorporate additional non-assessed learning ac- 
tivities as features to end up with method DKVMN-M. Also, 
inspired by the way of SAINT-M |5| to incorporate rich fea- 
tures into transformer-based model, we apply same strategy 
as described in the paper into SAKT and AKT to incorpo- 
rate additional non-assessed learning activities as response 
features that ends up with baseline methods SAKT-M and 
AKT-M, respectively. 


4.3 Implementation Details 

For binary response datasets, including EdNet and Junyi 
datasets, we convert the response tuple (q, 7) into a single 
value z = qe +r: X Q € {1,--- 2Q} as the lookup key of em- 
bedding layer. For numerical response MORF dataset, we 
feed the tuple (q,7+) into a linear layer to get the embed- 
ding. For the question q and non-assessed learning resource 
l4, we feed their ID into the embedding layers. 


For evaluation purpose, we perform the 5-fold user stratified 
cross-validation for all models and all datasets. Hence, for 
each fold, 60% users are used as the training set, 20% are 
validation set, and the rest 20% as test set. For each fold 
and every method, we use the validation set to tune the 
hyper-parameters and record the optimal training loss as 
the condition of early stopping. 


We utilize the Gaussian distribution with 0 mean and 0.2 
standard deviation to initialize the values of M* and Md. 
We learn the model using the Adam optimization with a 
learning rate of 0.01 and reduce the learning rate by half once 
the training loss increases, with the minimal learning 1le-5 
for all methods in 200 max epochs. We also utilize the norm 
clipping threshold to 50.0 to avoid gradient exploding for all 
methods. In addition, we follow the general processing steps 
for knowledge tracing that truncate long sequence and pad 
short sequence with 0s. The length of sequence is considered 
as a hyper-parameter of all models which needs to tune. 
In addition, we also tune the max sequence length of non- 
assessed learning activities between two assessed learning 
activities £;. If the length of non-assessed learning activities 
is over the maximum size, then we take the most recent ones. 
Similarly, if the length is less than the required sequence 
length, we pad with 0s. The table [2] shows the best hyper- 
parameters of our DMKT on 4 datasets. 


We implement the models using PyTorch on a computer 
with a single NVIDIA Tesla-K80 GPU. For DKT and DKT- 
M, our implementation is different from the original paper 
[13], and we follow the same idea suggested by that use 
norm clipping and early stopping, which could ease the gra- 
dient exploding as well as overfitting issues of LSTM. Xavier 
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Table 2: Hyperparameters of DMKT 


Dataset dn N_- seq. len. |L¢| 

MORF 128 8 50 8 

EdNet 128 «68 50 2 
Junyil564 256 8 50 2 
Junyi2063 256 32 50 2 


initialization is also used to initialize the parameters in DKT 
and DKT-M. All the baseline methods are implemented in 
PyTorch and tested to achieve similar performance as re- 
ported in the original paper except the SAINT ans SAINT- 
M. For SAINT, we borrow the implementation from githulf’} 
and extend it to the SAINT-M, since the authors did not 
release the code. 


4.4 Student Performance Prediction 
The results of predicting students’ performance in the as- 


sessed learning resources, including their 95-percentile con- 
fidence intervals, are shown in Table [3] The RMSE is mea- 


sured to evaluate the prediction performance on MORF dataset 


due to numerical user responses, and the AUC is measured 
on EdNet and Junyi datasets. A low RMSE score indicates 
a high prediction performance. An AUC of 0.5 represents 
the model’s performance is equivalent to a random guess 
model. A high AUC score accounts for a high prediction 
performance. As you can see, our proposed method, DMKT, 
achieves the best performance over all baseline methods on 
all four datasets. This shows that explicitly modeling non- 
assessed learning materials, along with the assessed ones, is 
essential in capturing the variations in student performance 
data. 


We can also see that by simply incorporating the non-assessed 
activities between two assessed activities as additional in- 
put features (the “M” models) the prediction performance 
is improved in some methods, such as AKT-M on MORF, 
EdNet, and Junyi datasets. However, unlike attention-based 
methods which could learn interaction correlation in a long 
sequence, this kind of simple integration strategy does not 
improve and may harm the prediction performance in other 
methods, such as in DKT-M and DKVMN-M, which tend 
to summarize past historical records as context embeddings. 


The reason we believe is this trivial integration of non-assessed 


activities not only loses a large amount of sequential infor- 
mation to model student knowledge transition over time, 
but also could introduce more noisiness on the data. We 
conclude that simply adding the non-assessed learning ac- 
tivities as features, without modeling them explicitly is not 
enough and may even harm the prediction performance in 
some models. 


SAINT and SAINT-M have transformer based architecture, 
which can stack multiple encoders and decoders. However, 
in our EdNet dataset that contains only 1,000 users with 
200, 931 records on 11, 249 questions, and without additional 
constraints or regularization as proposed in AKT (another 
transformer based model), SAINT and SAINT-M can eas- 
ily overfit the data. MVKM is the only existing baseline 


“nttps: //github.com/Shivanandmn/ 


method that can explicitly model multiple learning resource 
types. We can see that it can outperform the deep knowl- 
edge tracing methods that uses non-assessed learning ma- 
terials as features in MORF, which is a mid-size datasets. 
However, it cannot efficiently run in the larger datasets as 
the memory usage and linear time complexity over number 
of interaction records in MVKM limits its applicability on 
large datasets, such as EdNet and Junyi. Therefore, due to 
long running time on EdNet and Junyi datasets, we only 
report its performance in the MORF dataset. 


It is worth noting that when our model is fed with assessed 
learning resources only, it will be equivalent to DKVMN. 
However, as presented in the table, our proposed model 
DMKT achieves a better performance over DKVMN as well 
as DKVMN-M, because DMKT explicitly models the stu- 
dent knowledge transition on non-assessed learning activi- 
ties, which provides a more accurate encoded information 
to make the predictions accurately. 


4.5 Student Knowledge State Visualization 

To see how intractable the discovered student knowledge 
states are, we visualize the students’ knowledge states. Ba- 
sically, knowledge state visualization shows student’s knowl- 
edge mastery level on each concept before each attempt on a 
non-assessed or an assessed learning activity. This provides 
a useful tool to monitor student knowledge coverage over 
different concepts and helps instructors to analyze the stu- 
dent’s lacking concepts so as to provide tailored instructions 
for each student. To visualize student knowledge states, we 
follow the steps in section [3.8] to calculate knowledge state 
values over all concepts across the student sequence for each 
student. We show visualization of one example student’s 
knowledge states in the MORF dataset in figure [2] As you 
can see in the figure, the top x-ticks are labeled with stu- 
dent learning activities. Assessed learning materials (assign- 
ments) start with A and non-assessed ones (lecture videos) 
are annotated by the week they are scheduled and the se- 
quence of video lecture within the week. For example W4V0 
means the student has watched week 4 video lecture 0 and 
A1B denotes the Assignment-1B in week 1. The bottom x- 
ticks are labeled by either student performance (grade) in 
the assessed learning materials, or an icon indicating the 
non-assessed learning resource type. Each row represents 
one latent concept. In the figure, this student starts with a 
randomly initialized value memory matrix Mj at time step 
0 before working on A1B. After finishing the A1B, student’s 
knowledge is updated and increased a little on concept 3 and 
6 before working on A3. Student’s knowledge grows gradu- 
ally by working on assignments A3 and watching video lec- 
tures in week 4. However, student’s knowledge drops a little 
before working on assignment A4 and it explains the reason 
why that student only receives a score 0.3 at the first at- 
tempt. Student’s knowledge on all concepts grow by work- 
ing on the assignments until the student started watching 
video lecture W6V1. We can see a slight drop in student’s 
knowledge of some of the concepts (e.g., 7) and increase in 
other concepts (e.g., 1) while they are watching these videos. 
One potential reason for the decrease on concept 7 could be 
the lack of practice with assignments. Watching video lec- 
tures indeed improve student knowledge on concept 1 and 
2. Another reason for the drop in concept 7 could be re- 
lated to the student’s problem solving ability which results 
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Table 3: Student Performance Prediction Results on 3 Real-World Datasets. Root Mean Square Error (RMSE) and Area Under 
Curve (AUC) are used to evaluate performance on datasets with numerical feedback and binary feedback, respectively. The 


average performance over 5 folds as well as 95% confidence interval are reported. 


MORF EdNet Junyil564 Junyi2063 
Methods RMSE AUC AUC AUC 
DKT 0.1870 + 0.0191 0.6393 + 0.0137 0.8877 + 0.0050 0.8635 + 0.0059 
DKVMN 0.2042 + 0.0136 0.6296 + 0.0104 0.8843 + 0.0065 0.8558 + 0.0068 
DeepIRT 0.1946 + 0.0080 0.6290 + 0.0105 0.8749 + 0.0053 0.8498 + 0.0069 
SAKT 0.2113 + 0.0275 0.6334 + 0.0125 0.8623 + 0.0047 0.8053 + 0.0075 
SAINT 0.2019 + 0.0077 0.5205 + 0.0064 0.8454 + 0.0096 0.7951 + 0.0119 
AKT 0.2420 + 0.0155 0.6393 + 0.0104 0.8311 + 0.0102 0.8093 + 0.0091 
MVKM 0.1936 + 0.0096 - = - 
MLP-M 0.2433 + 0.0350 0.6102 + 0.0088 0.7055 + 0.0191 0.7290 + 0.0150 
DKT-M 0.1927 + 0.0194 0.6372 + 0.0120 0.8885 + 0.0048 0.8652 + 0.0069 
DKVMN-M | 0.2251 + 0.0128 0.6343 + 0.0074 0.8948 + 0.0054 0.8513 + 0.0059 
SAKT-M 0.2084 + 0.0272 0.6323 + 0.0109 0.8305 + 0.0071 0.7911 + 0.0107 
SAINT-M 0.1977 + 0.0055 0.5491 + 0.0068 0.8454 + 0.0096 0.7741 + 0.0139 
AKT-M 0.2239 + 0.0151 0.6404 + 0.0067 0.8296 + 0.0093 0.8099 + 0.0098 
DMKT 0.1369 + 0.0195 | 0.6675 + 0.0082 | 0.9440 + 0.0061 | 0.8714 + 0.0069 
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Figure 3: Concepts Matrix of Video Lectures in MORF Dataset. 


in their first attempt on Assignment A7 to have a score of 
0.3. Once the student’s first attempt on A7 is done, this stu- 
dent quickly masters concept 7 again and their knowledge 
on all concepts continues to grow along different activities. 
In this example, it seems the assessed learning material im- 
proves student knowledge more than watching video lectures, 
which is inline with the previous literature [10] [13]. Another 
observation is that this student skips watching video lec- 
tures in weeks 1, 2, and 3 before working on assignment A3. 
Similarly, they did not watch videos in week 5 and 6 before 
trying A5 and A6. This may explain that this student is not 
interested in watching video lectures and may not be fully 
present during watching video lectures which results in tiny 


knowledge growth over watching them. 


4.6 Concept Discovery 

In addition to tracing student knowledge over various types 
of learning activities, DMKT can provide a feasible solution 
to discovering underlying patterns or concepts for both as- 
sessed and non-assessed learning resources. In other words, 
the correlation weights w and w’, can be interpreted as the 
importance of latent concepts in each assessed, and non- 
assessed learning activity respectively. Meaning that, since 
the key matrix M* is used to model the knowledge con- 
cepts on the full course, the correlation weight between the 
learning resources and the concepts implies the strength of 
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Figure 4: Cluster Graph of Non-gradable Learning Materials (Hints) in Junyil564 Dataset Using 
corresponding to each hint is shown in the right table. (Best viewed in color) 
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Figure 5: Cluster Graph of Video Lectures using t-SNE and Titles of Video Lecture of MORF Dataset. Lectures under the same 
concept are labeled in the same color in the left picture and also are put in the same block in the right table. (Best viewed in 


color) 


their inner relationship. Not only we can use the correla- 
tion weight as latent concepts, we can also use them to find 
similar learning resources by clustering them over these cor- 
relation weights. 


For example, in Figure we visualize the importance of 
each concept in each of the MORF dataset video lectures. 
The X-axis ticks show the video lecture weeks and numbers 
and the Y-axis shows the latent concepts. As we can see, the 
concept matrix is relatively sparse, showing that most video 
lectures strongly belong to 2-3 concepts, while they do have 
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a soft memberships in other concepts too. Many video lec- 
tures in the same week have similar concept structures. For 
example videos 3, 4, and 5 of week 5 all have a strong rep- 
resentation of concept 0 and videos 0, 1, and 2 of week 1 all 
are having high correlation weights with concept 2. Given 
that the course schedule is designed by the instructor, such 
similarities between the concepts in videos of the same week 
are expected. Another interesting observation is the strong 
appearance of some concepts in videos of different weeks. For 
example, concept 1 can be seen in both video 4 of week 8 
and video 4 of week 6. This shows that these two video lec- 
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tures share some similarities that are not represented in class 
schedule. Looking at the video titles from this course (right- 
hand side of Figure [5) we can see that the video titles are 
State Space Diagrams and Hidden Markov Models, respec- 
tively, which are two very closely-related topics To better 
understand such similarities, we look at grouping of videos 
according to their discovered concepts in the following. 


To this end, we follow the clustering procedures as in 
to group the learning materials according to the discovered 
latent concepts. At the same time, we compare these group- 
ings by looking at the problem name associated with each 
hints and lecture titles for Junyi1564 and MORF datasets in 
F igures [4] and [5] respectively. To do the clustering, we first 
assign each learning resource with the concept ID that con- 
tains the largest correlation weight as the cluster label. Since 
there are 8 concepts in total, it results in 8 clusters. Then, 
we use t-SNE to visualize the clusters, which are shown in 
the left sides of F igures [4] and [5] for Junyil564 and MORF 
datasets, respectively. 


As we can see, the resulting t-SNE clusters are more dis- 
tinct in the Junyil564 dataset compared to MORF. In other 
words, most of clusters in Junyi1564 dataset could be easily 
separated and distinguished. This implies that the discov- 
ered concept matrix of the Junyi1564 dataset is more sparse 
than the one from the MORF dataset, leading to more out- 
standing clusters than in MORF, as shown in Figure[4] In- 
deed we have seen from Figure [3] that each video lecture 
in MORF is associated with two to three latent concepts 
rather than having only one distinct concept. This finding 
matches these datasets’ properties: in the MORF dataset, 
each assessed learning material is a full complex course prob- 
lem set which is assigned to students every week, and each 
non-assessed learning resource is a video lecture that cov- 
ers multiple knowledge concepts. On the contrary, the as- 
sessed learning materials in Junyil564 dataset are simple 
math problems, with close-to atomic concept coverage, and 
the non-assessed resources are hints associated with these 
problems. 


As another result of this clustering, and similar to our find- 
ings in Figure |3| we can see that the more similar or re- 
lated non-assessed learning materials are clustered together. 
For example, in Figure video lectures from week 5 are 
clustered together, showcasing the similarity between latent 
concepts in video lectures that are scheduled to be presented 
together in week 5 of the course. Additionally, video lectures 
that are conceptually similar to each other can be found 
grouped together. For example, video lectures from week 
6 (V4 - State Space Diagrams) and week 8 (V4 - Hidden 
Markov Models), from week 1 (V2 - Regressors) and week 7 
(V5 - Factor Analysis), and from week 1 (V6 - Case Study 
- San Pedro) and week 8 (V2 - Case Study - Discovery with 
models) are grouped together which are conceptually simi- 
lar. 


These findings are also in accordance with the previous find- 
ings in the literature on the MORF dataset and show 
that DMKT can efficiently discover the underlying concepts 
presented in the non-assessed learning materials, even though 
student performance on them is not observable. 


5. CONCLUSION AND FUTURE WORK 


In this paper, we proposed DMKT, the first deep learning 
based knowledge tracing model that can model and trace 
student knowledge in both assessed and non-assessed learn- 
ing resources, find the underlying connects and similarities 
between learning resources, and predict student performance 
in the assessed ones. We evaluated DMKT extensively, on 
four real world datasets and demonstrated that because of 
its explicit modeling of non-assessed learning materials, its 
ability in representing non-linear relationships and its ca- 
pacity in handling larger amounts of data, it outperforms all 
the baselines, in accurately predicting student performance. 
We further showcased DMKT’s ability in meaningfully trac- 
ing student knowledge over assessed and non-assessed learn- 
ing resources, and the potential effect that each of them 
can have on student knowledge. In our particular example, 
we showed that solving problems is a more effective way to 
learn for our selected student, compared to watching video 
lectures. Finally, we presented that DMKT can find inter- 
pretable latent concepts of non-assessed learning materials, 
that can be used to group them into meaningful clusters. 
In the future work, we would like to explore this model on 
various of learning activities to learn hidden patterns on dif- 
ferent learning resources so as to provide tailored learning 
resource recommendations. 
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ABSTRACT 

This paper drills deeper into the documented effects of the 
Cognitive Tutor Algebra I and ASSISTments intelligent tu- 
toring systems by estimating their effects on specific prob- 
lems. We start by describing a multilevel Rasch-type model 
that facilitates testing for differences in the effects between 
problems and precise problem-specific effect estimation with- 
out the need for multiple comparisons corrections. We find 


that the effects of both intelligent tutors vary between problems— 


the effects are positive for some, negative for others, and 
undeterminable for the rest. Next we explore hypotheses 
explaining why effects might be larger for some problems 
than for others. In the case of ASSISTments, there is no 
evidence that problems that are more closely related to stu- 
dents’ work in the tutor displayed larger treatment effects. 


Keywords 
Causal impact estimates,multilevel modeling,intelligent tu- 
toring systems 


1. INTRODUCTION: AVERAGE AND ITEM- 


SPECIFIC EFFECTS 


The past decade has seen increasing evidence of the effec- 
tiveness of intelligent tutoring systems (ITS) in supporting 
student learning [7][13]. However, surprisingly little detail 
is known about these effects such as which students experi- 
ence the biggest benefits, under what conditions. This paper 
will focus on the question of which areas of learning had the 
largest impact in two different year-long randomized trials: 
of the Cognitive Tutor Algebra I curriculum (CTA1) [17] 
and of the ASSISTments ITS [22]. 


Large-scale efficacy or effectiveness trials in education re- 
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search, including evaluations of ITS [17][18][22], often esti- 
mate the effect of an educational intervention on student 
scores on a standardized test. These tests consist of many 
items, each of which tests student abilities in, potentially, a 
separate set of skills. Prior to estimating program effects, 
analysts collapse data across items into student scores, of- 
ten using item response theory models [25] that measure 
both item- and student-level parameters. Then, these stu- 
dent scores are compared between students assigned to the 
intervention group and those assigned to control. 


This approach has its advantages, in terms of simplicity 
and (at least after aggregating item data into test scores) 
model-free causal identification. If each item is a measure- 
ment of one underlying latent construct (such as “algebra 
ability”) aggregating items into test scores yields efficiency 
gains. However, in the (quite plausible) case that posttest 
items actually measure different skills, and the impact of 
the ITS varies from skill to skill, item-specific impacts can 
be quite informative. 


In the case of CTA1 and ASSISTments, we find that, indeed, 
the ITS affect student performance differently on different 
posttest items, though at this stage it is unclear why the 
affects differed. 


The following section gives an overview of the two large- 
scale ITS evaluations we will discuss, including a discus- 
sion of the available data and of the two posttests. Next, 
Section 3 will discuss the Bayesian multilevel model we use 
to estimate item-specific effects, including a discussion of 
multiple comparisons; Section 4 will discuss the results— 
estimates of how the two ITS impacted different posttest 
items differently; Section 5 will present a preliminary ex- 
ploration of some hypotheses as to why ASSISTments may 
have impacted different skills differently; and Section 6 will 
conclude. 


2. THE CTA1 AND ASSISTMENTS TRIALS 
This paper uses data from two large-scale field trials of ITSs 
CTA1 and ASSISTments. The CTA1 intervention consisted 
of a complete curriculum, combining the Cognitive Tutor 
ITS, along with a student-centered classroom curriculum. 
CTA1 was a created and run by Carnegie Learning; an up- 
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dated version of the ITS is now known as Mathia. The Cog- 
nitive Tutor is described in more detail in [2] and elsewhere, 
and the effectiveness trial is described in [17]. ASSISTments 
is a free online-homework platform, hosted by Worcester 
Polytechnic Institute, that combines electronic versions of 
textbook problems, including on-demand hints and imme- 
diate feedback, with bespoke mastery-based problem sets 
known as “skill builders.” ASSISTments is described in [10] 
and the efficacy trial is described in [22]. 


This section describes the essential aspects of the field trials 
and the data that we will use in the rest of the paper. 


2.1 The CTA1 Effectiveness Trial 

From 2007 to 2010, the RAND Corporation conducted a 
randomized controlled trial to compare the effectiveness of 
the CTA1 curriculum to business as usual (BaU). The study 
tested CTA1 under authentic, natural conditions, i.e., over- 
sight and support of CTA1’s use was the same as it would 
have been if there was not a study being conducted. Nearly 
20,000 students in 70 high schools (n = 13, 316 students) and 
76 middle schools (n = 5,938) located in 52 diverse school 
districts in seven states participated in the study. Partici- 
pating students in Algebra I classrooms took an algebra I 
pretest and a posttest, both from the CTB/McGraw-Hill 
Acuity series. 


Schools were blocked into pairs prior to randomization, based 
on a set baseline, school-level covariates, and within each 
pair, one school was assigned to the CTA1 arm and the other 
to BaU. In the treatment schools, students taking algebra 
I were supposed to use the CTA1 curriculum, including the 
Cognitive Tutor software; of course, the extent of compliance 
varied widely [12][11]. 


Results from the first and second year of the study were re- 
ported separately for middle and high schools. In the first 
year, the estimated treatment effect was close to zero in mid- 
dle schools and slightly negative in high schools. However, 
the 95% confidence intervals for both these results included 
negative, null, and positive effects. In the second year, the 
estimated treatment effect was positive-roughly one fifth of 
a standard deviation—for both middle and high schools, but 
it was only statistically significant in the high school stra- 
tum. 


In this study, we make use of students’ overall scores on 
the pretest, anonymized student, teacher, school, and ran- 
domization block IDs, and an indicator variable for whether 
each student’s school was assigned to the CTA1 or BaU, 
along with item-level posttest data: whether each student 
answered each posttest item correctly. For the purposes of 
this study, skipped items were considered incorrect. 


2.1.1 Posttest: The Algebra Proficiency Exam 

The RAND CTAI1 study measured the algebra I learning 
over the course of the year using the McGraw-Hill Algebra 
Proficiency Exam (APE). This was a multiple choice stan- 
dardized test with 32 items testing a mix of algebra and 
pre-algebra skills. Table 1, categorizes the test’s items by 
the algebra skills they require, and gives an example of a 
problem that would fall into each category. The categoriza- 
tion was taken from the exam’s technical report [6]. 


2.2 The Maine ASSISTments Trial 

From 2012-2014, SRI International conducted an random- 
ized field trial in the state of Maine to estimate the effi- 
cacy of ASSISTments in improving 7th grade mathemat- 
ics achievement. Forty-five middle schools from across the 
state of Maine were randomly assigned between two condi- 
tions: 23 middle schools were assigned to a treatment condi- 
tion; mathematics teachers in these schools were instructed 
to use ASSISTments to assign homework, receiving support 
and professional development while doing so. The remain- 
ing 22 schools in the BaU condition were barred from using 
ASSISTments during the course of the study but were of- 
fered the same resources and professional development as 
the treatment group after the study was over. The study 
was conducted in Maine due to the state’s program of pro- 
viding every student with a laptop, which allowed students 
to complete homework online. 


The 45 participating schools were grouped into 21 pairs and 
one triplet based on school size and prior state standard- 
ized exam scores; one school in each pair, and two schools 
in the triplet, were assigned to the ASSISTments condition, 
with the remaining schools assigned to BaU. Subsequent to 
random assignment, one of the treatment schools dropped 
out of the study, but its matched pair did not. Although 
the study team continued to gather data from the now- 
unmatched control school, that data was not included in the 
study. However, we are currently unable to identify which of 
the control schools was excluded from the final data analysis, 
so the analysis here includes 44 schools, while [22] includes 
only 43. 


The study measured student achievement on the standard- 
ized TerraNova math test at the end of the second year of im- 
plementation, and estimated a treatment effect of 0.18+0.12 
standard deviations. 


In this study, we make use of anonymized student, teacher, 
school, and randomization block IDs, and an indicator vari- 
able for whether each student’s school was assigned to the 
ASSISTments or BaU, along with item-level posttest data: 
whether each student answered each posttest item correctly. 
For the purposes of this study, skipped items were consid- 
ered incorrect. The initial evaluation included a number of 
student-level baseline covariates drawn from Maine’s state 
longitudinal data system, include prior state standardized 
test scores. We do not currently have access to that data; 
the only covariate available was an indicator of whether each 
student was classified as special education. 


2.3 The TerraNova Test 


The primary outcome of the ASSISTments Maine trial was 
students’ scores on the TerraNova Common Core assessment 
mathematics test, published by Data Recognition Corpora- 
tion CTB. The TerraNova assessment includes 37 items, 32 
of which were multiple choice and 5 of which were open re- 
sponse. Unfortunately, we detected an anomaly in the item- 
level data for the open-response questions, so this report will 
focus only on the 32 multiple choice questions. 


The items are supposed to align with the Common Core 
State Standards, but the research team was not given a 
document aligning CCSS with the test items. Instead, a 
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Objective Items Example 

Functions and Graphs 6, 8, 19, 20, 22, Which of these points is on the graph of |func- 
23, 27, 31, 32 tion] 

Geometry 12, 18, 24, 29 Find the length of the base of the right trian- 


gle shown below 


Graphing Linear Equations 5, 9, 15, 17, 26 Which of the lines below is the graph of [lin- 
ear equation]? 

Quadratic Equations and Functions 2, 25, 28, 30 Which of these shows a correct factorization 
of [quadratic equation]? 

Solving Linear Equations and Linear Inequal- 1, 4, 11, 13, 16 Solve the following system of equations 

ities 

Variables, Expressions, Formulas 3, 7, 10, 14, 21 Which of these expressions is equivalent to 


the one below? 


Table 1: Objectives required for the 32 items of the Algebra Proficiency Exam, the posttest for the CTA1 Evaluation 


member of the ASSISTments staff with expertise in middle 
school education aligned them according to her best judg- 
ment. Table 2 gives this alignment. More information on 
specific standards can be found at the CCSS website [16]. 


3. METHODOLOGY: MULTILEVEL EFFECTS 


MODELING 


In principal estimating program effects on each posttest item 
is straightforward: the same model used to estimate effects 
on student overall scores could be used to estimate effects 
on each item individually (perhaps—but not necessarily— 
adapted for a binary response). However, estimating 32 sep- 
arate models for each stratum of the CTA1 study, and 32 
separate models for the ASSISTments study ignores mul- 
tilevel structure of the dataset, and leads to imprecise es- 
timates. Moreover, doing so invites problems of multiple 
comparisons—between the four strata of the CTA1 study 
and the ASSISTments study, there are 160 separate effects 
to estimate. If each estimate is subjected to a null hypoth- 
esis test at level a = 0.05, even if neither ITS affected test 
performance at all, we would still expect to find roughly 
eight significant effects. 


Instead, we estimated item-specific effects with a multilevel 
logistic regression model model [8], based roughly on the 
classic “Rasch” model of item response theory [25]{20]. That 
is, we estimated all item-specific effects for a particular ex- 
periment simultaneously, with one model, in which the item- 
specific effect estimates are random effects. The separate 
effects were modeled as if drawn from a normal distribu- 
tion with a mean and standard deviation estimated from 
the data. This normal distribution can be thought of as 
a Bayesian prior distribution; the fact that its parameters 
are estimated from the data puts us in the realm of empir- 
ical Bayes [5]. This prior distribution acts as a regularizer, 
shrinking the several item-specific effect estimates towards 
their mean [15]. Although doing so incurs a small amount 
of bias, it reduces standard errors considerably while main- 
taining the nominal coverage of confidence intervals [23]. 


Gelman, Hill, and Masanao [9] argue that estimating a set 
of different treatment effects within a multilevel model also 
obviates the need for multiplicity corrections. Generally 
speaking, the reason for spurious significant results is that 
as a group of estimates gets larger, so does the probability 
that one of them will exceed the test’s critical value. In 


other words, as a the set of estimates grows, so does their 
maximum (and their minimum, in magnitude). Multilevel 
modeling helps by shrinking the most extreme estimates to- 
wards their common mean. Since extreme values are less 
likely in a multilevel model, so are spuriously significant ef- 
fect estimates. 


A small simulation study in the Appendix (mostly) supports 
Gelman et al.’s argument. As the number of estimated ef- 
fects grows, the familywise error rate (i.e. the probability 
of any type-I error in a group of tests) grows rapidly if ef- 
fects are estimated and tested separately, but not if they are 
estimated simultaneously in a multilevel model. However, 
the error rates for the multilevel model effect estimates are 
slightly elevated—hovering between 0.05 and 0.075 through- 
out. There is good reason to believe that a fully Bayesian 
approach will improve these further (see, e.g., [21], p. 425). 


3.1 The Model for the CTA1 Posttest 

For the CTA1 RCT, we estimated a separate model for high 
school and middle school, but we combined outcome data 
across the two years. Let Y;; = 1 if student 7 answered item 
j correctly, and let 7; = Pr(Yi; = 1). Then the multilevel 
logistic model was: 


logit(mi;) = Bo + BiY ear2; + B2T rt; + 83Pretest; 
+ BsY ear2;Trt; + BsY ear2; Pretest; 
+ yj0 + yal rts + yj2Y ear2; + y53Y ear2;T rt; 


(1) 


ba On + Nels{i] + Esch[i] 


Where Year2; = 1 if student i was in the 2nd year of the 
study and 0 otherwise, Trt; = 1 if student 7 was in a school 
assigned to treatment, and Pretest; is i’s pretest score. The 
coefficients Bo—Gs are “fixed effects,” that is, they are not 
given any probability model. yjo-y;3 vary with posttest 
item j, and are modeled jointly as multivariate normal: 
y ~ MVN(0,%), where & is a 4 x 4 covariance matrix 
for the 7 terms. Similarly, the random intercepts 6;, Nets[i, 
and €scn{ij; Which vary at the student, classroom, and school 
level, are each modeled as univariate normal with mean 0 
and a standard deviation estimated from the data. 


Collecting like terms in model (1), note that for a student 
in the first year of the study, the effect of assignment to 
the CTAI condition is 62 + 7,1 on the logit scale; in other 
words, the effects of assignment to CTA1 in year 1 are mod- 
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CCSS Items 
Expressions and Equations 17,28 

Functions (8G) 26,27 

Geometry 12,16,19,21,23,31 
Make sense of problems and persevere in solving them 13 

(MP) 

Ratios and Proportional Relationships 22,24,25,29 
Reason abstractly and quantitatively (MP) 15,20 

Statistics and Probability 10,11,32 


The Number System 


1,2,3,4,5,6,7,8,9,14,18 


Table 2: Common Core State Standards (CCSS) for the 32 multiple choice TerraNova items, as identified by the ASSISTments 
team. Standards are from grade 7 except where indicated—-grade 8 (8G) or Mathematical Practice (MP) 


eled as normal with a mean of (G2 and a variance of Noo. 
The variance Nes estimates the extent to which the effect of 
assignment to the CTA1 condition varies from one problem 
to another. If the effect were the same for every posttest 
problem, we would have “22 = 0. For students year 2, the 
effect on problem 7 is 82+ 84+7j1 +73 on the logit scale— 
the effects are normally distributed with a mean of 62 + (G4 
and a variance of Me2 + N44 + 2¥e4. The © matrix also in- 
cludes the covariance between the effects of the intervention 
on items in year 1 and the effects on the same items in year 
2 as 


Cov(yj1, V1 + 93) = Var(qy1) + Cou(yj1, 33) = N22 + Yes 


Likelihood ratio tests using the y? distribution can test the 
null hypothesis that the variance of treatment effects are 0. 
For simplicity, we did so using separate models for the two 
years, rather than the combined model (1). 


The treatment effects themselves are estimated using the 
BLUPs (best linear unbiased predictors) for the random ef- 
fects y. In many contexts, random effects are considered 
nuisance parameters, and primary interest is in the fixed 
(unmodeled) effects B. However, there is a long tradition, 
mostly in the Bayesian and empirical Bayes literature, of 
using BLUPs for estimation of quantities of interest. The 
models were fit in R [19] using the 1me4 package [3], which 
provides empirical Bayesian estimates of the conditional (or 
posterior) variance of the BLUPs, which we use (in combi- 
nation with the estimated standard errors for fixed effects) 
in constructing confidence intervals for item-specific effects. 


3.2 The Model for the ASSISTments Posttest 
The model for estimating item-specific effect of ASSIST- 
ments on TerraNova items was highly similar to model (1). 
There were three important differences: first, there was only 
one year of data. Second, we did not have access to pretest 
scores, but we did include an indicator for special education 
status as a covariate. Lastly, the hierarchical variance struc- 
ture for student errors was somewhat different—we included 
an error term for teacher instead of classroom, and included 
random intercepts for randomization block.+ 


'In linear models it is typically recommended to include 
fixed effects for randomization block [4]. In logistic regres- 
sion, including a large number of fixed effects violates the 
assumptions underlying the asymptotic [1]. We tried it both 
ways and found that it made little difference. 


All in all, the model was: 


logit(m:3) = Bo + PiTrti + BoSpEd; 
+ y0 + YT rts (2) 


bi + Neches] + Esch{i] 


Cpair(i] 


where SpEd; = 1 if student 7 is classified as needing special 
education, Mcnfi] is a random intercept for 2’s teacher, and 
Cpair{é] is a random intercept for 2’s school’s randomization 
block. The rest of the parameters and variables are defined 
the same as in (1). The treatment effect on problem j is 
modeled as (1 + 7j1 for multiple choice items. The random 
effects y ~ N(0,™) where © is a 2 x 2 covariance matrix. 


4. MAIN RESULTS: ON WHICH ITEMS DID 
ITSS BOOST PERFORMANCE? 
4.1 CTA1 


Figure 1 gives the results from model (1) fit to the middle 
school and to the high school sample. Each point on the plot 
represents the estimated effect of assignment to the CTAL 
condition on the log odds of a correct answer on one posttest 
item. The estimates are accompanied by approximate 95% 
confidence intervals. 


It is immediately clear that the effect of assignment to CT 
vary between posttest items—indeed the x? likelihood ratio 
test rejects the null hypothesis of no treatment effect vari- 
ance with p < 0.001 in all four strata. 


In the middle school sample, the average treatment effect 
across items was close to 0 for both years (-0.08 in year 1 
and 0.03 in year 2 on the logit scale), and not statistically 
significant. However, the standard deviation of treatment ef- 
fects between problems was much higher—0.31 in year 1 and 
0.29 in year 2, implying that assignment to CTA1 boosted 
performance on some problems and hurt performance on 
others. To interpret the standard deviation of effects on the 
probability scale, consider that for a marginal student, with 
a 1/2 probability of answering an item correctly, a difference 
of 0.3 between two treatment effects would correspond to a 
difference in the probability of a correct answer of about 
7.5% (using the “divide by 4 rule” of [8] p. 82). The ef 
fects are also moderately correlated across the two years, 
with p & 0.4—items that CTA1 impacted in year 1 were 
somewhat likely to be similarly impacted in year 2. 


Many of the treatment effects in the upper pane of Fig- 
ure 1 are estimated with too much noise to draw strong 
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Figure 1: Estimated treatment effects of CTA1 for each level—high school or middle school—implementation year, and posttest 


item, with approximate 95% confidence intervals 


conclusions—the sample size was substantially smaller in 
the middle school stratum than in the high school stratum. 
However, some effects are discernible: in year 1, effects were 
negative, and on the order of roughly 0.4 on the logit scale 
(0.1 on the probability scale for a marginal student) on items 
1, 2, 9, 10, 12, 19, 22, and 25, and on the order of approxi- 
mately 0.7 for item 17 (which asks students to match a linear 
equation to its graph), and similarly-sized positive effects on 
items 27, 30, and 32. In year 2 there were fewer clearly neg- 
ative effects—on items 1 and 7—and more positive effects, 
such as on items 16, 18, 22, 29, and 32. There is a strik- 
ing difference between the year 1 and year 2 effects on item 
22, which asks students to match a quadratic expression to 
its graph—the effect was quite negative in year 2 and quite 
positive in year 2. 


In the high school sample, the average treatment effect across 
items was roughly -0.1 in year 1 and 0.13 in year 2, on the 
logit scale, neither statistically significant-though the differ- 
ence between the average effect in the two years was signifi- 
cant (p < 0.001). The effects varied across items, though less 
widely in high school than in middle school—in both years 
the standard deviation of item-specific effects was roughly 
0.17. Item-specific effects were more highly correlated across 
years (p © 0.69)—at some points in the lower pane of Fig- 
ure 1 it appears as though the curve from year 2 was simply 
shifted up from year 1. 


The item-specific effects in the high school sample were esti- 
mated with substantially more precision than in the middle 
school sample, due to a larger sample size. In year 1, there 


were striking negative effects on items 2, 14, and 25 which 
ask students to manipulate algebraic expressions, and on 
item 12, which ask students to calculate the length of the 
side of a triangle. In year 2, these negative effects disap- 
peared. Instead, there were positive effects, especially on 
items 8 and 22, which both ask about graphs of algebraic 
functions, and on a stretch of items from 15-22. The differ- 
ence in the estimated effects between years was positive for 
all items and highest for problems 2, 20, and 25, which ask 
students to manipulate or interpret algebraic expressions, 
and 12, the triangle problem. In items 2, 12, and 25, the 
effect was significantly negative in year 1 and closer to zero 
in year 2, while for item 20 the effect was close to zero in 
year 1 and positive in year 2. 


Figure 2 plots the estimated effect on each posttest item as 
a function of the item’s objective in Table 1. Some patterns 
are notable. There was a wide variance in the effects on 
the four geometry problems for middle schoolers in year 1, 
but in year 2 all the effects on geometry items were posi- 
tive and roughly the same size. The geometry items in the 
high school sample follow a similar, if less extreme, pattern. 
Across both middle and high school, the largest positive ef- 
fects were for Functions and Graphs problems, especially 
item 22 for year 2; on items 23, 27, 31 and 32, middle 
schoolers—especially in year 2—saw positive effects while 
high schoolers saw effects near 0. 


4.2 ASSISTments 
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Figure 2: Estimated treatment effects of CTA1 posttest items arranged by the group of skills each item is designed to test. See 
Table 1 for more detail. 
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Figure 3: Estimated treatment effects of ASSISTments for each multiple choice posttest item, with approximate 95% confidence 
intervals 
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Figure 3 gives the results from model (2), plotting item- 
specific effect estimates with approximate 95% confidence 
intervals for each multiple choice TerraNova posttest item. 
The model estimated an average effect of 0.33, with a stan- 
dard error of 0.23, for multiple choice problems. The stan- 
dard deviation of item-specific effects was positive (p<0.001) 
but less than for the CTA1 items: it was estimated as 0.16 
on the logit scale. The confidence intervals in Figure 3 are 
also much wider than those for CTA1; we suspect that a 
large part of the reason is that we did not have access to 
pretest scores, an important covariate. 


The largest effects on the multiple choice items were 28 
and 17, which both required students to plug in values for 
variables in algebraic expressions. The confidence intervals 
around the effects for items 26 and 32 also exclude 0. 


Figure 4 plots item-specific effects for multiple choice Ter- 
raNova items grouped according to their CCSS, as in Ta- 
ble 2, with the non-grade-7 standards grouped together as 
“Other.” Interestingly, the largest effects tended to be for 
items in this “Other” category—as did the smallest effect, 
for item 13. Effects for problems in the “Number System” 
and “Ratios and Proportional Relationships” categories had 
the most consistent effects, between 0.2 and 0.4 on the logit 
scale. 


5. EXPLORING HYPOTHESES ABOUT wuy 


ASSISTMENTS EFFECTS DIFFERED 
Researchers on the ASSISTments team have built on the 
CCSS links of Table 2, linking TerraNova posttest items to 
data on student work within ASSISTments, for students in 
the treatment condition. This gives us an opportunity to 
use student work within ASSISTments to explain some of 
the variance in treatment effects. 


Like TerraNova items in Table 2, ASSISTments problems 
are linked with CCSS. By observing which problems treat- 
ment students worked on, and using this linkage, we could 
observe which Common Core standards they worked on the 
most within ASSISTments. We hypothesized that treat- 
ment effects might be largest for the TerraNova problems 
that were linked with the Common Core standards students 
spent the most time working on. In other words, we linked 
TerraNova items with worked ASSISTments problems via 
Common Core standards. The Common Core linkage we 
used in this segment was finer-grained than Table 2, so Ter- 
raNova items in the same category in Table 2 may not be 
linked with the same problems in this analysis. 


We examined our hypothesis in two ways: examining the 
relationships between treatment effects and the number of 
related ASSISTments problems students in the treatment 
group worked, and the number of related ASSISTments prob- 
lems students in the treatment group worked correctly. This 
analysis includes two important caveats: first, the linkages, 
both between TerraNova items and CCSS, and between AS- 
SISTments problems and CCSS, were subjective and error- 
prone, possibly undermining the linkage between TerraNova 
items and ASSISTments problems. Secondly, student work 
in ASSISTments is necessarily a post-treatment variable—it 
was affected by treatment assignment. If the treatment ran- 
domization had fallen out differently, different schools would 


have been assigned to the ASSISTments condition and dif- 
ferent ASSISTments problems would have been worked. In- 
cluding the number of worked or correct related problems 
as a predictor in a causal model risks undermining causal 
interpretations [14]. 


Figures 5 and 6 plot estimated item-specific effects for mul- 
tiple choice TerraNova items against the number of ASSIST- 
ments problems that students in the treatment arm worked 
or worked correctly, respectively, over the course of the RCT. 
The X-axis is on the square-root scale, and a loess curve is 
added for interpretation. Little, if any, relationship is appar- 
ent in either figure, suggesting either the lack of a relation- 
ship between specific ASSISTments work and posttest items, 
or issues with the linkage. This is hardly surprising, given 
both the difficulty in linking ASSISTments and TerraNova 
problems, and given the fact that topics in mathematics are 
inherently connected, so that improving one skill tends to 
improve others as well. 


6. CONCLUSIONS 


Education researchers are increasingly interested in “what 
works.” However, the effectiveness of an intervention is 
necessarily multifaceted and complex—effects differ between 
students, as a function of implementation [24], and, poten- 
tially, as a function of time and location. In this paper we 
explored a different sort of treatment effect heterogeneity— 
differences in effectiveness for different outcomes—specifically, 
different posttest items measuring different skills. Collaps- 
ing item-level posttest data into a single test score has the 
advantage of simplicity (which is nothing to scoff at, espe- 
cially in complex causal scenarios) but at a cost. Analysis 
using only summary test scores squanders a potentially rich 
source of variability and information about intervention ef- 
fectiveness that is already at our fingertips. There is little 
reason not to examine item-specific effects. 


In this paper, we showed how to estimate item specific effects 
using a Bayesian or empirical Bayesian multilevel modeling 
approach that, we argued, can improve estimation precision 
and avoid the need for multiplicity corrections. The esti- 
mates we provided here combine maximum likelihood esti- 
mation and empirical Bayesian inference; there is good rea- 
son to suppose that a fully Bayesian approach would provide 
greater validity, especially in standard error estimation and 
inference. However, fitting complex multilevel models using 
Markov Chain Monte Carlo methods is computationally ex- 
pensive, and can be very slow, even with the latest software. 
We hope to explore this option more fully in future work. 


While estimating item-specific effects is relatively straight- 
forward, interpreting them presents a significant challenge. 
This is due to a number of factors: first, when looking for 
trends in treatment effects by problem attributes, the sam- 
ple size is the number of exam items, not the number of 
students, so patterns can be hard to observe and verify. 
Secondly, there is a good deal of ambiguity and subjectiv- 
ity involved in defining and determining item attributes and 
features, which is exacerbated by the fact that standardized 
tests generally cannot be made publicly available. Lastly, 
since student ITS work over the course of a study is nec- 
essarily post-treatment assignment, careful causal modeling 
(such as principal stratification [24]) may be necessary. Ex- 
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Figure 4: Estimated treatment effects of ASSISTments for each multiple choice posttest item, arranged according to CCSS, as 
in Table 2. The “Other” category includes Functions and the two Mathematical Practice standards, “make sense of problems 
and persevere in solving them” and “reason abstractly and quantitatively”. 
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Figure 5: Estimated effects on multiple-choice TerraNova items plotted against the number of related ASSISTments problems 
that students in the treatment arm worked over the course of the study. The X-axis is plotted on the square-root scale, and a 
non-parametric loess fit is added for interpretation. 
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Figure 6: Estimated effects on multiple-choice TerraNova items plotted against the number of related ASSISTments problems 
that students in the treatment arm worked correctly over the course of the study. The X-axis is plotted on the square-root scale, 
and a non-parametric loess fit is added for interpretation. 
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amining heterogeneity between item-specific treatment ef- 
fects may play a larger role in helping to generate hypotheses 
about ITS effectiveness than in confirming hypotheses. 


Despite those difficulties, the analysis here uncovered impor- 
tant information about the CTA1 and ASSISTments effects. 
First, the discovery that the effects vary between items is 
notable in itself. In our analysis of CTA1 we noticed that 
some of the largest effects—and differences between first and 
second-year effects— were for posttest items involving ma- 
nipulating algebraic expressions and interpreting graphs. In 
our analysis of ASSISTments, we discovered a large differ- 
ence between negative effects on open-ended questions and 
positive effects on multiple choice questions, and also that 
the largest effects were on problems requiring students to 
plug numbers into algebraic expressions. 


We hope that this research will serve as a proof-of-concept 
and spur further work delving deeper into data we already 
have. 
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APPENDIX 
A. ASIMULATION STUDY OF MULTIPLE 
COMPARISONS 


We ran a small simulation study testing [9]’s assertion that 
multiplicity corrections are unnecessary when estimating dif- 
ferent effects from BLUPs in a multilevel model. [9] stated 
their case in terms of fully Bayesian models, whereas we used 
an empirical Bayesian approach that may differ somewhat. 
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Figure 7: United we stand: results from a simulation of fam- 
ilywise error rate using separate t-tests for each experiment 
or using multilevel modeling. 


In our simulation, in each simulation run, we generated 
data on Nexpr experiments, where Nexpr was a param- 
eter we varied. In each experiment, there were n = 500 
simulated subjects, half assigned to treatment and half to 
control. They were given “outcome” data Y ~ N(0,1), with 
no treatment effect. 


We analyzed the experiment data in two ways. First, we 
estimated a p-value for each experiment separately, using 
t-tests. This is the conventional approach. Then, we we 
estimated a multilevel model: 


Yij = Po + Vag Hupry + ag Trti + ej 


where (> is an intercept, 71; are random intercepts for exper- 
iment, 72; is the treatment effect for experiment j, and €;; is 
a normally-distributed error term. y ~ MV N ({0, 20}, 4) 
where ‘20 is the average effect across all experiments. The 
number of experiments in each simulation run, Nexpr, was 
varied from 5 to 40, in increments of 5. In each case, we 
estimated the familywise error rate, the probability of at 
least one statistically significant effect estimate (at a = 0.05) 
across the Nexpr experiments. 


The results are in Figure 7. As expected, the familywise 
error rate increased rapidly when effects were estimated and 
tested separately in each of the Nexpr experiments. When 
effects were estimated jointly in a multilevel model, in a 
way analogous to the method described in Section 3, the 
familywise error rate remained roughly constant as Nexpr 
increased. However, the familywise error rate in the multi- 
level modeling approach was slightly elevated, ranging from 
roughly 0.05 to 0.075. 
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Math Operation Embeddings for Open-ended Solution 
Analysis and Feedback 
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ABSTRACT 


Feedback on student answers and even during intermediate 
steps in their solutions to open-ended questions is an im- 
portant element in math education. Such feedback can help 
students correct their errors and ultimately lead to improved 
learning outcomes. Most existing approaches for automated 
student solution analysis and feedback require manually con- 
structing cognitive models and anticipating student errors 
for each question. This process requires significant human 
effort and does not scale to most questions used in home- 
works and practices that do not come with this information. 
In this paper, we analyze students’ step-by-step solution pro- 
cesses to equation solving questions in an attempt to scale 
up error diagnostics and feedback mechanisms developed for 
a small number of questions to a much larger number of 
questions. Leveraging a recent math expression encoding 
method, we represent each math operation applied in so- 
lution steps as a transition in the math embedding vector 
space. We use a dataset that contains student solution steps 
in the Cognitive Tutor system to learn implicit and explicit 
representations of math operations. We explore whether 
these representations can i) identify math operations a stu- 
dent intends to perform in each solution step, regardless of 
whether they did it correctly or not, and ii) select the ap- 
propriate feedback type for incorrect steps. Experimental 
results show that our learned math operation representa- 
tions generalize well across different data distributions. 


Keywords 
Embeddings, Feedback, Math expressions, Math operations 


1. INTRODUCTION 


Math education is of crucial importance to a competitive 
future science, technology, engineering, and mathematics 
(STEM) workforce since math knowledge and skills are re- 
quired in many STEM subjects [11]. One important way 
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to help struggling students improve in math is to diagnose 
errors from student answers to math questions and deliver 
personalized support to help them correct these errors [1]. 
In short-answer questions, feedback of various types [39] can 
be deployed according to the specific incorrect final answers 
students submit, while in open-ended questions, feedback 
can be deployed at intermediate solution steps according to 
the specific actions they take and their outcomes [22]. In 
traditional educational settings, this feedback process relies 
on teachers going over student work, identifying errors, and 
providing feedback [15], which results in a labor-intensive 
process and a slow feedback cycle for students. Such a set- 
ting is even more limited as a result of the COVID-19 pan- 
demic, which introduced new barriers to face-to-face inter- 
actions between teachers and students. 


In intelligent tutoring systems, a more scalable approach to 
math feedback is to automatically deploy feedback based on 
students’ final answers or certain incorrect intermediate so- 
lution steps. For example, in ASSISTments [12], teachers 
can create hints and feedback messages for specific incorrect 
student answers to short-answer questions that they antici- 
pate [28], which the system can automatically deploy when 
students submit these incorrect answers. This crowdsourc- 
ing approach efficiently scales up teachers’ effort so that they 
can benefit a large number of students without putting in 
additional effort. In many other systems such as Cognitive 
Tutor [34] and Algebra Notepad [27], researchers use cogni- 
tive models to anticipate student errors as results of buggy 
production rules or insufficient knowledge on key math con- 
cepts [20, 24]. They then develop corresponding feedback 
for intermediate solution steps in multi-step questions (e.g., 
those on equation solving). This cognitive model-based ap- 
proach requires significant effort by domain experts and has 
shown to be highly effective in large-scale studies. 


However, these approaches for student feedback are still lim- 
ited in their generalizability to many math questions de- 
ployed in daily homeworks and practices. For the teacher 
crowdsourcing approach, hint and feedback messages have 
to be written for each individual question (or group of ques- 
tions generated from the same template with different nu- 
merical values). For the cognitive model-based approach, a 
rigorous solution process has to be specified for each ques- 
tion with annotations on the math operations that should 
be applied at each solution step. However, questions used 
in many real-world educational settings do not come with 
such information; teachers simply adopt them from sources 
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such as textbooks and open education resources and assign 
them to students without developing corresponding feedback 
mechanisms. Moreover, past research has shown that a large 
portion of incorrect student answers cannot be anticipated 
by cognitive models [43], teachers/domain experts [8], or nu- 
merical simulations [37]. Therefore, it may be hard for high- 
quality feedback developed for questions used in intelligent 
tutoring systems to generalize to questions in the wild. 


1.1 Contributions 

In this paper, we develop data-driven methods that enable 
us to analyze step-by-step solutions to open-ended math 
questions. In contrast to existing methods that rely on a 
top-down approach, i.e., defining the structure of the so- 
lution process and anticipating student errors, we propose 
a bottom-up approach, i.e., using learned representations of 
math expressions and math operations to predict i) math 
operations in student solution steps and ii) the appropriate 
feedback for incorrect solution steps. We restrict ourselves 
to the specific domain of equation solving where the solu- 
tion process consists of applying specific math operations 
between math expressions in consecutive steps; other sub- 
domains of math such as algebra word problems [45] and 
questions involving graphs and geometry [16] are left as fu- 
ture work. Specifically, our contributions are: 


e First, we characterize math operations by how they 
transform math expressions in the math embedding 
space in each solution step. We leverage recent work 
on learning math symbol embeddings from large-scale 
scientific formula data [46] to encode math expressions 
in student solutions: each math expression is mapped 
to a point in the math embedding vector space. We use 
synthetically generated data as well as solution steps 
generated by real students to learn the representation 
of each math operation. We explore several meth- 
ods for learning both implicit and explicit math op- 
eration representations: a classification-based method 
that does not explicitly impose a structure on math op- 
erations, a linear model that assumes each operation is 
characterized by an additive vector in the embedding 
space, and a nonlinear model where math operations 
live in their own, interconnected embedding spaces. 


e Second, we apply these math operation representation 
learning methods to a real-world student step-by-step 
solution dataset collected while student learn equation 
solving in an intelligent tutoring system, Cognitive Tu- 
tor [34]. We validate our math operation representa- 
tion learning methods via two tasks: i) predicting the 
specific math operation the student intended to ap- 
ply in a solution step from the math expressions be- 
fore and after the step and ii) predicting the appro- 
priate feedback deployed to students from the incor- 
rect math expressions they enter. Quantitative results 
show that tree embedding-based math expression en- 
coding methods outperform other encoding methods 
since they are able to explicitly capture the seman- 
tic and structural characteristics of math expressions. 
They also have better generalizability across different 
data distributions and remain effective across different 
question difficulty levels and even when student solu- 
tions steps contain errors. 


Question 
Solve for vz: 4n+32+2=12-—5-9 


Solution steps Predicted math operations 
and feedback 
(feet) Peay 
1. COMBINE_ADD (100%) 
Tea+9=T-2@ 
dl ® 2. COMBINE_ADD (60%), ADD_SIDE (40%) 
Te+9+a=T7 
82 = —2 
1 @® 3. DIV_SIDE (100%) 
82/8 = —2/8 
dt ® 4. COMBINE_MUL (92%), COMBINE_ADD (8%) 
a= —2/8 
1® 5 
cy 
S 3 


Figure 1. Demonstration of the generalizability of our math 
operation representations to other data sources for a solution 
process provided on Algebra.com. Our methods can success- 
fully predict the math operations applied in each step and 
the appropriate feedback type in an incorrect step. 


1.2 Use Case 


Before diving into the technical details, we first illustrate 
a potential use case for our math operation representation 
learning methods and corresponding operation/feedback 
classifiers. Our goal is to transfer expert designs in intel- 
ligent tutoring systems for math education to questions in 
the wild. Specifically, we apply the math operation rep- 
resentations learned from student solution steps and corre- 
sponding labels (step name, feedback message) in the highly 
structured Cognitive Tutor system to environments that are 
not highly structured. Figure 1 shows the solution process 
to an equation solving question on Algebra.com’ and the 
corresponding math operation and feedback predictions at 
each step. We see that our math operation representation 
learning methods can accurately predict the math opera- 
tions applied in solution steps 1, 3, and 4 using the opera- 
tion names provided in the Cognitive Tutor system. Even 
in step 2 where two different math operations are combined 
into a single step, i.e., 


7e#+9=7-2 

{| ADD x TO BOTH SIDES 
Ye+94+u=7-2+2 

J COMBINE TERMS ON RIGHT SIDE 
Tz+94+2=7, 


despite only training on steps in Cognitive Tutor that involve 
only one math operation, the classifier is able to recognize 
both of them with high predictive probability for both. We 


'The original question and the solution process can be 
found at https://www.algebra.com/algebra/homework/ 
equations/Equations.faq.question.4872.html. 
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also change one of the solution steps, i.e., step 5, to make 
it incorrect and test our feedback classifier. In this case, 
the classifier is able to recognize the error in this step and 
find the corresponding feedback types in Cognitive Tutor. 
This potential use case demonstrates the utility of our math 
operation representation learning methods: by transferring 
knowledge learned in well-designed, highly-structured sys- 
tems such as Cognitive Tutor, especially on what feedback 
to deploy for each student error, to other domains such as 
online math Q&A sites, we are scaling up the effort domain 
experts put into the design of these feedback mechanisms. 


2. RELATED WORK 


One related body of work in math education that studies 
student solution processes to identify student strategies and 
assess errors. Specifically, [33] uses inverse Bayesian plan- 
ning to learn solution strategies (i.e., policies) in equation 
solving and capture student misunderstandings in a Markov 
decision process framework. Our work focuses on a differ- 
ent aspect of the solution process: the representation of the 
math expressions at each solution step and the modeling 
of the transitions between different math expressions under 
math operations. [9] uses basic math operations to con- 
struct programs to understand errors that students make in 
their solutions to arithmetic questions. Our work focuses 
on equation solving, which is a more difficult problem in 
which students responses are are more diverse and are less 
structured than arithmetic calculations. 


Another related body of work focuses on learning representa- 
tions of student answers to short-answer questions. [21] an- 
alyzes incorrect student answers across multiple questions, 
learn representations of errors, and generalize misconception 
feedback across questions. Our work analyze the full math 
expressions in intermediate solution steps while their work 
represents short answers according to the frequency they 
occur in an answer pool. [8] uses trained word embeddings 
to represent short answers for automated grading purposes. 
Our work focuses on learning transitions of math expressions 
across solution steps instead of learning representations of 
only the final answer. 


In domains other than math education, there exist methods 
for automated feedback generation, including programming 
[30, 31, 40] and essays [35]. However, transferring these 
methods to math solutions is not trivial since i) open-ended 
math solutions are less structured than programming code 
and ii) data-driven representations of math symbols have not 
been developed until recently [46] whereas such representa- 
tions have been studied for a long time in natural language 
processing [6, 7, 26]. 


Another body of remotely-related work focuses on using 
computer vision techniques to identify math expressions 
from images for similar math expression retrieval [29], turn- 
ing hand-written math expressions into TX [47], and au- 
tomatically identifying and correcting student errors [14]. 
These works often bypass the inherent structure of math 
expressions and directly use an end-to-end model for their 
tasks, which means that they cannot be used to analyze 
student knowledge. Nevertheless, these techniques can be 
used to build large-scale datasets containing hand-written 
student solutions which we can use in the future. 


3. BACKGROUND: EMBEDDING MATH 
EXPRESSIONS INTO VECTOR SPACES 


In this section, we provide an overview of a recent method 
that we developed to embed math expressions into a vec- 
tor space, i.e., a math embedding space. Doing so turns 
discrete, symbolic math expression representations into con- 
tinuous, distributed representations [2], which enables us to 
manipulate math expressions in a manner compatible with 
modern machine learning methodologies. 


Our embedding method is a tree-structured encoder illus- 
trated in Figure 2. The key observation is that any math 
expression has a corresponding symbolic tree-structured rep- 
resentation in the operator tree format. In the operator tree, 
the non-terminal (non-leaf) nodes are math operators, i.e., 
addition and subtraction, and terminal (leaf) nodes are num- 
bers or variables; See Figure 2 for an illustration. Thus, an 
operator tree explicitly captures the semantic and structural 
properties of a math expression. A number of existing works 
have demonstrated the superior performance of using oper- 
ator tree representations of math expressions compared to 
other math expression representations in applications such 
as automatic math word problem solving [32, 48, 51] and 
math formulae retrieval [5, 25, 49, 50]. 


Therefore, we built a math expression encoder that lever- 
ages the operator tree representation of math expressions. 
Specifically, during the encoding process, it first converts a 
math expression into its corresponding tree format, using the 
parser introduced in [5]. It then linearizes the tree by depth 
first search that enables us to process nodes as a sequence in 
which each math symbol is associated with its own trainable 
embedding. Next, it leverages positional encoding, similar 
to [44, 38], to retain the relative position of each node in the 
tree. The output of our encoder is a fixed-dimensional em- 
bedding vector that represents the input math expression, 
which we will use to learn representations of math operations 
for the math operation classification and feedback prediction 
tasks. We pretrain the encoder on a large corpus of math 
expressions extracted from Wikipedia and arXiv articles and 
demonstrated superior performance in reconstructing math 
expressions (and scientific formulae) and retrieving similar 
expressions. See the anonymized version of our work at [46]. 
We will refer to the trained encoder as the math expression 
encoding method in what follows. 


4. LEARNING REPRESENTATIONS OF 
MATH OPERATIONS 


In this section, we detail methods we use to learn both im- 
plicit and explicit math operation representations by study- 
ing how they transform math expressions in each solution 
step in the math embedding space. In these methods, we 
leverage the math expression encoding method developed in 
our prior work that we reviewed above to embed math ex- 
pressions into vectors and work with these embedding vec- 
tors. However, since these embeddings are trained on math 
expressions that are very different from those occurring in 
actual student solution steps, we use an additional train- 
able, fully-connected neural network to adapt these embed- 
dings to our dataset, following a popular approach in natural 
language processing [13]. Specifically, we have e = gy(m) 
where m and e are the embedded vector of a math expression 
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math expression operator tree 


e=22-4 — 


linearized tree 
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math embedding 


: 


Figure 2. Illustration of the math expression encoding method that we employ in this work. 


in our dataset before and after the adaptation, respectively. 
7 denotes the set of parameters in the fully-connected net- 
work that we will learn during the training process. 


We define a step in a student’s solution to open-ended math 
questions as a tuple (€1,€2,z), where z € Z is the math 
operation applied in this step, with Z denoting the set of 
possible math operations. €; € E and €2 € E denote the 
math expressions involved in this step before and after ap- 
plying this math operation, i.e., the step can be expressed as 
E€; —> Eg. E denotes the set of all unique math expressions 
(across all steps in a dataset). For simplicity, we assume that 
only one math operation is applied in each step; an extension 
to cases where multiple math operations is trivial and will 
be discussed in what follows. e; € R? and eo € R” are the 
fine-tuned embedding vectors that correspond to math ex- 
pressions €; and €2, respectively, where D is the dimension 
of the embedding. 


4.1 Math Operation Classification 
The first task we will study in this paper is to classify the 
math operation applied in a solution step given the math 
expression embeddings before and after appliying it, e: and 
e2. The same notations and approaches also apply to our 
second task, feedback classification. This task can simply be 
solved using a supervised learning method, e.g., a regression 
model where the predicted probability of predicting a math 
operation 2 is given by 

p(é = z) = softmax(vz [e1, e3]"), 
where softmax(-) is the softmax function for multi-label clas- 
sification [10]. vz is a parameter vector associated with each 
math operation z, which is used to compute an inner product 
with the concatenation of e; and e2 before being fed into the 
softmax function. On a training dataset with given tuples 
(e1, 2, 2), we can learn the parameters (v.) by minimizing 
the cross-entropy loss [10] between the predicted math op- 
eration 2 and the actual math operation. This approach can 
be seen as learning implicit representations of math expres- 
sions since they are captured by the classifier parameters. 


4.2 Learning Math Operation 


Representations 

The classification approach we detailed above can help us 
classify the math operation applied in a solution step but 
falls short on learning explicit representations of math oper- 
ations. The latter is important, however, to help us under- 
stand students’ math solution processes and diagnose their 
errors. We now detail a series of methods for us to learn 
explicit representations of math operations. 


4.2.1 Translating embeddings 


TransE TransR 
' eSpace ' zSpace { 
q ReLU(Mze;} hz 
i eA H t ery ' 
E> 
: ‘ erg 
Encoder ca a 
Math expression] Math operation 
embeddin embeddin: 
InputPair : (Ey & z) 


Example: (3@+2x2=6 5x=6 COMBINE_ADD) 


Figure 3. Illustration of the TransE and TransR frameworks. 
TransE puts the embeddings of equations e1, e2, and math 
operation z in the same embedding space, whereas 'TransR 
puts them in their own embedding spaces. 


We will leverage the translating embedding (TransE) frame- 
work [3] that has found success in embedding entities 
and characterizing relationships between entities in multi- 
relational data. Our key assumption here in this framework 
is that math operations are linear and additive, i.e., the rela- 
tionship between math expressions before and after a math 
expression satisfy 


eo Sei + hz, 


where hz € R” is the embedding of the math operation z. In 
other words, we assume that the effect of a math operation 
is characterized by the difference in the embedding vectors 
between the math expressions before and after it in a single 
step; adding it to the embedded vector of €; results in the 
embedded vector of €2 after the step. 


To learn these math operation embeddings from data, we 
use two loss functions. The first loss function promotes this 
linear and additive relationship between embeddings of the 
math expressions and operations on the training data. To 
this end, we define a distance function as d(e1,e2,h.) = 
je: +h, — eg||3 and define the loss function as 


I1= > d(es,e2,h.). 


(€1,€2,2) 


The second loss function pushes counterfeit step tuples that 
are generated by replacing elements in an observed step tu- 
ple with other ones in the dataset to not satisfy the afore- 
mentioned linear and additive relationship. To this end, we 
minimize the pairwise marginal distance ranking-based loss 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 219 


given by 


ue SOS 


(E1,€2,2)€S (E1,E5,2)ESLe, 5 2) 


[y + d(e1, e2, hz) — d(e},e),h2)]4+, 


where [2], = x when x > 0 and 0 otherwise and y > 0 
is a hyper-parameter that controls the margin of the dis- 
tance ranking. S denotes the set of steps in the dataset and 

(€1,€2,2) i8 a set of counterfeit steps that are perturbed ver- 
sions of the actual step (€1,€2,z), generated by randomly 
replacing one of the triplet elements in the step by a differ- 
ent math expression or math operation from another step, 
iLe., 


S(e,,£9,2) ~ AUBUC, 
where A = {(€{,€2,z) :€, #€1 € E} 

B= {(&1,€3,z): €  €2 € E} 

C={(E1,€2,2'): 2° Fz2E Zh. 


Intuitively speaking, our objective encourages the distance 
function calculated on an actual tuple in the dataset to be 
smaller than that calculated on a perturbed version of it. 
Figure 3 illustrates the whole process. 


The final loss function that we minimize is simply the combi- 
nation of these two loss functions as L = LZ; + L2. Using the 
learned embeddings of each math operation, we can classify 
them from the math expressions €; and €2 using the nearest 
neighbor classifier, i.e., 2 = argmin,d(e1, e2, hz). 


4.2.2 Learning Entity and Relation Embeddings 
Despite potentially exhibiting excellent interpretability, 
TransE’s assumption that math operations are linear and 
additive in the math expression embedding space may be 
too restrictive. This assumption puts math operations are 
vectors in the same latent space where similar math expres- 
sions will be close to each other. However, different math 
operations are fundamentally different and can transform 
the same math expression into dramatically different math 
expressions that are far apart in the embedding space. For 
example, different math operations can focus on transform- 
ing different parts of the same math expression. The steps 
(3+5+2¢ =x+1, 84+22 =2+1, combine similar terms) 
and (3+5+2% = 2#+1,34+5+2r%-2x2 =2r+41 
x, subtract from each side) have the same starting math ex- 
pression €;. In the first step, only similar terms on the left 
hand side of the equation are combined, regardless of the 
other side of the equation. In the second step, we subtracted 
az from both sides of the equation, which is a consequence 
of the equality symbol in the equation, which means that 
subtracting the same term on both sides of the equation but 
not what exactly is on each side. Therefore, TransE’s lin- 
ear and additive assumption means that the resulting €2 in 
these steps will be very different due to the different math 
operations applied, which conflicts with the observation that 
they are very similar. To address this limitation, we explore 
the Learning Entity and Relation Embeddings (TransR) [23] 
model, which models math expressions and math operations 
in different spaces, i.e., there will be a shared embedding 
space for all math expressions but separate relation spaces 
for different math operations. 


TransR learns the embeddings of math operations by pro- 
jecting them to their corresponding relation spaces and then 
learning translations between those projected expressions. 
For each math operation z, we set a projection matrix 
M. € R?*” that projects a math expression to its rela- 
tion space. To make this projection nonlinear, we apply the 
rectified linear unit (ReLU) activation function [10] to it and 
define the corresponding distance function as 


d.(e1,e2, hz) = ||ReLU(M-e1) + hz — ReLU(M-e2)||3. 


Correspondingly, the two loss functions in the TransR frame- 
work are given by 


I1n= 5° d.(e1,e2,h:), 


(E1,€2,2) 


ne Dy 


(E1,€2,2)€8 (E},€5,2)ESle, 6,2) 


ly + dz(e1, 2, hz) _ dz(e1,e9, 1 |e 


The projection matrices M., Vz € Z are included as part of 
the trainable parameters. The rest of the training and re- 
sulting math operation classification procedure remains un- 
changed from the TransE framework. 


5. EXPERIMENTS 


We now detail a series of quantitative and qualitative exper- 
iments that we have conducted to validate the learned rep- 
resentations of math operations. Using the Cognitive Tutor 
2010 equation solving (CogTutor) dataset,” we focus on two 
tasks: i) classifying the math operation a student applies in 
a solution step and ii) classifying the feedback category cor- 
responding to certain types of incorrect steps, from the math 
expressions the student enters before and after the step. 


5.1 Dataset 

We use the CogTutor dataset which we accessed via the 
PSLC DataShop [19]. The dataset contains detailed tu- 
tor logs generated as students in a school use the Cog- 
nitive Tutor system [34] for their Algebra I class. These 
logs contain the students’ step-by-step solutions to equa- 
tion solving problems, where each step is a tuple with 
three elements: a math expression €, at the beginning of 
the step, the step name z, i.e., the math operation the 
student selected to apply to this math expression, and 
the resulting math expression €2 after the step. Students 
can select math operations from a built-in list in Cogni- 
tive Tutor: COMBIN_ADD, COMBINE_MUL, ADD_SIDE, 
SUB_SIDE, MUL_SIDE, DIV_SIDE, and DISTRIBUTE; see 
Table 1 for an illustration of these operations and some ex- 
amples of the corresponding math operations before and af- 
ter them in a step. 


There are a total of 50,406 steps in this dataset that can be 
further divided into three subsets according to their out- 
comes: OK (43,413 steps), ERROR (6,377 steps), and BUG 
(5,744 steps). The OK subset contains steps that are cor- 
rect, i.e., the student both selected the correct math op- 
eration and arrived at the correct math expression. The 
BUG and ERROR subsets contain incorrect student steps, ei- 
ther because the operation they selected was incorrect or 


2https://pslcdatashop.web.cmu.edu/Dataset Info? 
dataset Id=660 
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Step (Math operation) | Description 


Example 


COMBINE_ADD 


combine two similar terms with add/sub operator 


324+ 2x4 > 5a 


COMBINE_MUL 


combine two similar terms with multiply/divide operator | #* a — x” 


ADD_SIDE add a math term on each side e=1l724+1=141 
SUB_SIDE subtract a math term on each side z=1l>42-1=1-1 
MUL-_SIDE multiply a math term on each side gt=1lor+2=2 
DIV_SIDE divide a math term on each side e=1—>2/2=1/2 
DISTRIBUTE distribute(expand) the terms (e+l)a>ouxn+a 


Table 1. Detailed descriptions and examples for each math operation in the CogTutor dataset. 


because they selected the correct operation but did not ap- 
ply it correctly, i.e., arriving at an incorrect math operation 
after the step. The difference between these two subsets is 
that BUG contains steps that fit one of the predefined er- 
ror templates in the Cognitive Tutor system; in this case, 
the system can automatically diagnose the error and deploy 
a predefined feedback. On the other hand, ERROR contains 
incorrect steps that Cognitive Tutor could not automati- 
cally diagnose the underlying error. The OK subset can be 
further split into six predefined difficulty levels (named as 
ES_01,ES_02, ES_03 ,ES_04, ES_05, and ES_07), with 2, 068, 
7,546, 8,183, 13,393, 5,484, and 2,801 steps, respectively. 
We do not further split the BUG and ERROR subsets for the 
math operation classification task due to their limited sizes. 


To learn the representation of math operations, we need 
examples of how they transform one math expression into 
another. However, the CogTutor dataset may not contain 
enough data that is rich in both quantity and diversity for 
neural network-based models to learn from. Therefore, we 
designed a synthetic data generator stemming from the math 
question answering dataset created by DeepMind [36]. The 
generator can generate steps by first generating the initial 
math expression and then applying math operations listed 
in Table 1 to arrive at a resulting math expression. We 
have full control over the generated steps through the en- 
tropy, degree, and flip parameters. Increasing entropy intro- 
duces more complexity to the math expressions as numer- 
ical constants generated get larger. Increasing the degree 
parameter introduces monomials of higher degrees and also 
adds more terms in the math expression. Finally, the flip 
parameter allows us to control which side of an equation 
has a higher chance to be more complicated than the other. 
Tuning these parameters within this flexible synthetic data 
generation method enables us to generate a large amount of 
steps that closely resembles those in the CogTutor dataset. 


5.2. Methods 


To fully evaluate the effectiveness of our math operation 
representations, we also experiment with two other ways of 
encoding math expressions commonly used in natural lan- 
guage processing tasks, in addition to the tree embedding- 
based and translation-based encoder that we introduced in 
Section 4.2. These two encoders include a gated recurrent 
unit (GRU)-based encoder [4] and a convolutional neural 
network (CNN)-based encoder [17]; we will use the output 
of these encoders to replace ler, e; |" as input to the clas- 
sifier detailed in Section 4.1. 


Specifically, these two encoders first concatenates the two 
math expressions before and after the step, i.e., E = [€1, Eo]. 


For each character x; in €, we compute its embedding 
r= W’ onehot(z:) ; 


where W is a trainable embedding matrix. Using these char- 
acter embeddings, the GRU encoder computes 


hte = GRUo (at, ht-1), 


where @ represents all the trainable parameters in GRU. We 
then replace [e7, e/]7 with hr as input to the classifier 
where T is the total number of characters in €. Similarly, 


the CNN encoder computes 
h = max_pool(CNNg((x1,--- ,x7])), 


where CNNg represents a 2D CNN with parameters ¢ and 
max_pool is a 1D max pooling operator. Combined, they 
return a fixed dimensional feature vector h that replaces 
let, e3]” as input to the classifier. For each of these two 
models, we learn its parameters jointly with the classifica- 
tion task using the cross-entropy loss that we described in 


Section 4.1. 


Overall, we test five different methods for the math oper- 
ation classification and feedback classification tasks. The 
first three methods use different encoding methods in con- 
junction with a classifier: i) using the GRU encoder to en- 
code math expressions as input to the classifier, which we 
dub GRU+C, ii) using the tree embedding-based encoder in- 
stead, which we dub TE+C, and iii) using the CNN encoder 
instead, which we dub CNN-+C. These methods do not learn 
explicit representations of math operations. The next two 
methods use the TransE and TransR frameworks to learn 
these representations using tree embeddings: iv) using tree 
embedding-based encoder as input to the TransE framework 
in conjunction with a nearest neighbor classifier, which we 
dub TE+TransE, and v) using the TransR framework in- 
stead of the TransE framework to study math operations in 
multiple relation spaces, which we dub TE+TransR. 


5.3. Experimental Setup 

We first test our math operation representation learning 
methods on the OK subset via 5-fold cross-validation, i.e., 
training on 80% of steps in the subset to learn representa- 
tions of math operations and testing them on the remaining 
20%. We also test the generalizability of the learned repre- 
sentations to incorrect steps, i.e., replace the test set with 
the ERROR and BUG subsets, and check whether we can still 
recognize the math operation a student applied in an incor- 
rect step. The results are detailed in Section 5.4.1. 


Since the distribution of math expressions in the 0K, ERROR 
and BUG data subsets are mostly similar with minor differ- 
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OK ERROR BUG 
GRU+C 99.18 + 0.23 | 93.87 + 0.66 | 95.89 + 0.63 
TE+C 99.82 + 0.04 | 93.30 + 0.65 | 95.38 + 0.62 
CNN+C 95.37 + 0.44 | 86.82 + 1.38 | 91.02 + 0.59 
TE+TransE | 96.27+0.17 | 86.32 + 1.23 | 84.21 + 2.13 
TE+TransR | 99.17 +0.21 | 91.28+41.12 | 91.314 1.87 


Table 2. Math operation classification accuracy for all meth- 
ods training on the OK subset of the CogTutor dataset and 
testing on different data subsets. Accuracy is high across 
the board, while GRU-based encoding and tree embedding- 
based encoding in conjunction with a classifier result in the 
best performance. 


ences, the previous experiment does not give us a good idea 
on the generalization ability of our math operation repre- 
sentation learning methods. Therefore, we further divide 
the OK subset into six smaller subsets, each corresponds to 
a different difficulty level (with different structure and com- 
plexity) according to questions within it, and test the gen- 
eralizability of the learned math operation representations. 
The results are detailed in Section 5.4.2. In practice, so- 
lution step data generated by real students is often limited. 
Therefore, we conduct two more experiments to test whether 
synthetically generated steps can help us learn math opera- 
tion representations that generalize to real data. First, we 
repeat the experiments above using synthetically generated 
steps as the training set. This synthetic training set consists 
of 1,000 steps for each math operation defined in Table 1 
(adding up to a total of 7,000 across different difficulty lev- 
els). The results are detailed in Section 5.4.3. Second, to 
study the impact of synthetically generated data when real 
data is limited, we pre-train the math operation represen- 
tations with synthetic data, fine-tune on a small amount of 
real data from each difficulty level in the OK subset, and test 
on the rest. The results are detailed in Section 5.4.4. 


To test the ability of our learned math operation representa- 
tions on recognizing student errors, we use them to classify 
feedback types provided by CogTutor in the BUG data subset. 
Examples of such errors include when a student calculated 
the wrong simplification result, used the wrong sign in front 
of terms, and applied useless/unlogical steps to solve the 
problem, etc. The results are detailed in Section 5.4.5. 


We use Adam optimizer [18] with learning rate 0.001, batch 
size 64 and run 10 training epochs for each experiment. 
The math expression encoder outputs length-512 embed- 
ding vectors for each math expression, which we adapt to 
length-32 embedding vectors dimensions using a trainable 
fully-connected neural network. All of our experiments were 
conducted on a server with a single Nvidia RTX8000 GPU. 


5.4 Results and Discussion 


5.4.1 Generalizing to incorrect steps 

Table 2 shows the averages and standard deviations of math 
operation classification accuracy for every method we ex- 
perimented with using the OK subset as the training set. As 
expected, testing on the ERROR and BUG subsets result in 
slightly lower (5-10%) math operation classification accu- 
racy for all methods since the training set does not contain 


incorrect steps. However, even on steps that are incorrect, 
these methods can still effectively identify the math opera- 
tion a student intended to apply (with up to 95% accuracy), 
suggesting that they may be applicable to fully open-ended 
question solving solutions that are not highly structured, un- 
like those in Cognitive Tutor, to provide feedback to teachers 
on students’ solution approaches. 


We observe that using GRUs and tree embeddings as repre- 
sentations for math expressions and applying a classification 
method on top of these representations result in similar per- 
formances; GRUs slightly outperform tree embeddings in 
cases where we use the ERROR and BUG subsets as the test 
set while tree embeddings slightly outperform GRUs in the 
case where we use a part of the OK subset as the test set. 
Using CNNs to encode math expressions as input to a clas- 
sifier results in worse performance, suggesting that they do 
not capture the semantic and structural information in math 
expressions as well as GRUs and tree embeddings. As ex- 
pected, using tree embeddings under the TransE and TransR 
frameworks leads to worse performance than the first two 
methods, with TransE achieving low performance (especially 
on the BUG subset) and TransR achieving comparable per- 
formance to the classification-based methods on the OK sub- 
set but lower performance on the ERROR and BUG subsets. 
This result can be explained by the additional structural 
restriction that math operations are represented as linear 
and additive in some embedding space in the TransE frame- 
work, which makes it less robust against incorrect student 
solution steps. Using the TransR framework mitigates this 
problem due to its use of different relation spaces for each 
math operation. 


These methods perform similarly in the math operation clas- 
sification task on real data largely due to the limited varia- 
tion and complexity in the math expressions. The Cognitive 
Tutor system limits the degrees of freedom in a students’ 
response by splitting an open-ended step into the separate 
actions of selecting a single math operation and entering 
the resulting math expression, which limits the variability 
in the data. In the next experiment, we see that when we 
control against different levels of complexity in these math 
expressions and forcing these methods to generalize across 
complexities, their performance vary significantly. 


Figure 4 visualizes the confusion matrix for math operation 
classification on the OK subset and the pairwise euclidean 
distances between math operation embeddings learned via 
the TransE framework using tree embeddings for math ex- 
pressions. Rows correspond to the true math operations 
applied in steps and columns correspond to predicted ones. 
Percentages in the confusion matrix (Figure 4a) are nor- 
malized w.r.t. the number of appearances of each math 
operation. We see that our math operation representa- 
tion learning method captures some meaning of these op- 
erations (Figure 4b); the learned math operation embed- 
dings capture the structural changes in math expression in 
ways that match our intuition. For instance, both COM- 
BINE_ADD and COMBINE_MUL can be considered types 
of simplifications, so the Euclidean distance between the 
learned embeddings for these two operations is low. This 
observation is not surprising due to the similar nature 
of these operations. Moreover, COMBINE_ADD, COM- 
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COMBINE_ADD 


COMBINE_MUL 


ADD_SIDE 


MUL_SIDE 


SUB_SIDE 


DIV_SIDE 


DISTRIBUTE 


COMBINE_ADD 


COMBINE_MUL ~4 


ADD_SIDE 


MUL_SIDE 


SUB_SIDE 


DIV_SIDE 


DISTRIBUTE 


(b) Euclidean distance between learned math 
operation embedding vectors. 


Figure 4. Details of TE+TransE for the math operation 
classification task on the OK subset. These results match 
our intuition on how these math operations are related. 


BINE_MUL, and DISTRIBUTE are often confused with one 
another. These results are also validated by a 2-D visu- 
alization (using t-SNE [42] as a dimensionality reduction 
method) of the learned math operation embeddings in Fig- 
ure 5, where different math operations are mostly well sep- 
arated except for COMBINE_ADD, COMBINE_MUL, and 
DISTRIBUTE. One possible explanation is that these op- 
erations are all applied to one side of the equation during 
a solution step, leaving one side of the equation unchanged, 
while the other operations, such as ADD_SIDE, SUB_SIDE, 
MUL_SIDE, and DIV_SIDE are all applied to both sides 
of the equation. Therefore, this result suggests that tree 
embeddings enable us to characterize a math operation by 
the structural change in math expressions before and after 
a solution step where it is applied. Furthermore, the classi- 
fication accuracy for the DISTRIBUTE operation is signif- 
icantly lower than that for other operations. This result is 
likely due to the fact that the number of steps with this op- 
eration is significantly lower than that for other operations. 


5.4.2 Generalizing to different difficulty levels 

In this experiment, we test the ability of our learned math 
operation representations to generalize to math expressions 
with different levels of complexity in questions at differ- 
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Figure 5. Visualization of learned math expression change 
for a randomly sampled subset of student solution steps in 
2-D and corresponding operations (best viewed in color). 


ent levels of difficulty. Although they are all about equa- 
tion solving, questions at different difficulty levels in Cog- 
nitive Tutor involve math expressions that look very differ- 
ent. For example, in the easiest level (ES_01), the equation 
that needs to be solved in a question looks like x + 5 = 9, 
with only a single variable and without numbers with dec- 
imals. In contrast, in the hardest level (ES_07), a ques- 
tion may contain coefficients with several decimal places 
and multiple variables, such as solve for m in the equation 
m(k — n) = gs. We only compare the GRU-based encoder 
and the tree embedding-based encoder in conjunction with 
a Classifier since they are the best performing methods in 
the previous experiment. Table 3 lists the math operation 
classification accuracy for both methods after training on 
steps at different difficulty levels in the OK subset and testing 
on steps at other difficulty levels (including incorrect ones). 
We see that TE+C overall outperforms GRU+C in almost 
every case. This results suggest that tree embeddings are 
effective at capturing the structural property of a math ex- 
pression. As a result, math operation representations based 
on tree embeddings excel at capturing the structural change 
in math expressions before and after applying a math op- 
eration, leading to better generalizability than GRU-based 
encoding that do not explicitly account for this change. 


5.4.3 Generalizing to different data distributions 

In this experiment, we test the ability of our methods to 
generalize from synthetically generated data to real student 
data. We train different math operation classification meth- 
ods on the 2, 000 synthetically generated steps and test them 
on steps generated by real students in the CogTutor dataset. 
Table 4 shows the mean and standard deviation for each 
method on each real data subset. We see that TE+C signif- 
icantly outperforms GRU+C and CNN-+C on all data sub- 
sets, which is in stark contrast to the previous experiment 
where the difference in performance across all methods is 
much smaller. This observation suggests that tree embed- 
dings are more effective at capturing the semantic/structural 
effect of math operations on math expressions, thus general- 
izing better to different data distributions. Indeed, although 
the synthetically generated steps and the real steps have the 
same set of math operations, the distributions of numbers 
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on Method | 0K ERROR BUG 
ns 01 | GRU+C | 58.82+1.12 | 63.7441.13 | 66.02 + 1.12 
TE+ 76.51 + 0.62 | 84.24+0.87 | 67.49+1.10 
BS 02 | CRUFC | 7105 £112 | 76.66£111 | 69.01 E114 
TE+ 87.89 + 0.34 | 93.96 + 0.72 | 80.44 + 0.78 
BS 03 | CRUFC | 82.39£3.93 | 79.24E 147 [80.01 E167 
TE+ 90.79 + 1.12 | 93.83 +1.32 |} 84.70+1.54 
Bs 04 | GRU+C | 76.72£0.14 | 71.35 £6.12 | 83.32 £2.24 
TE+C | 94.65+0.12 | 92.72+1.32 | 90.99 +1.72 
BS 05 | GRU+C | 81.744 0.33 | 73.36 £1.69 | 78.36 £1.07 
TE+C | 87.66+0.25 | 80.00+ 1.32 | 77.81 + 0.99 
BS 07 | GRUHC | 76.25£3.21 | 73.15 £3.42 | 67.35 £3.62 
TE+C | 79.44+0.62 | 79.29+0.72 | 72.53 + 2.26 


Table 3. Math operation classification accuracy after train- 
ing on steps with different difficulty levels and testing on the 
OK ERROR, and BUG subsets. Tree embedding-based encoding 
outperforms GRU-based encoding. 
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Figure 6. Math operation classification accuracy for the 
TE+C method when real data is limited. Using synthet- 
ically generated steps as a starting point, we already start 
with acceptable classification accuracy even with few real 
steps generated by students. The performance steadily im- 
proves after more real data becomes available. 


(1,0.5,—7, etc.) and variables (x, u,t, etc.), resulting in a 
mismatch between the data distributions. Tree embedding- 
based methods benefit from the tree-based representations 
of math expressions that can effectively capture structural 
information, making it easy for the learned embeddings of 
math expressions to generalize to unseen data. 


5.4.4 Generalizing from synthetic data 

Ideally, if there is a large amount of training data, i.e., steps 
generated by real students containing different types of math 
expressions and detailed labels on these steps such as the 
math operation(s) applied, the error(s) if a step is incorrect, 
and corresponding feedback, we can simply use that data 
to learn our math operation representations. However, in 
practice, the amount of real data is often limited. Figure 6 
plots the performance of TE+C on all subsets of the Cog- 
Tutor dataset, training on a portion of steps in the subset 
for training and testing on the rest. We see that the perfor- 
mance on math operation classification suffers considerably 
when we only have limited training data. Therefore, syn- 


OK ERROR BUG 
GRU+C 62.89 + 3.93 | 64.06 + 4.70 | 62.94 + 2.24 
TE+C 83.79 + 0.14 | 75.49+ 0.90 | 75.16 + 0.55 
CNN-C 51.12 + 1.64 | 45.52 + 0.98 | 59.82 + 1.68 


TE + TransE | 80.17 + 2.32 | 71.86 + 3.24 | 72.32 + 2.72 
TE + TransR | 82.22 + 2.88 | 73.83 + 3.46 | 74.85 + 3.23 


Table 4. Math operation classification accuracy for all meth- 
ods training on 7,000 synthetically generated steps and test- 
ing on different subsets of the CogTutor dataset. Tree 
embedding-based methods significantly outperform other 
methods, showing better ability to generalize to different 
data distributions. 
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Figure 7. Math classification accuracy (difference in per- 
centage) for TE+C, pre-training on synthetic data before 
fine-tuning on real data versus training only on real data. 
When real data is limited, pre-training on synthetic data 
results in significantly better performance. 


thetically generated data can play a vital role in improving 
their performance under this circumstance; the strategy of 
fine-tuning models trained on synthetically generated data 
using a small amount of real data can be effective. Specif- 
ically, we start with a pre-trained math operation classifi- 
cation model on the 7000 synthetically generated steps and 
fine tune it on a small number of real steps by doing gradi- 
ent descent on these steps for 10 epochs. Figure 7 plots the 
improvement in math operation classification accuracy for 
the fine-tuned model over the model that trains on only real 
data of various amounts on all data subsets. We see that 
the pre-trained models always performs better, with signif- 
icant improvement when the real data is extremely limited. 
This result suggests that i) effectively leveraging synthet- 
ically generated data can mitigate the problem of limited 
real data and ii) our math operation representation learn- 
ing methods are capable of generalizing across different data 
distributions (synthetic — real). 


5.4.5 Feedback type classification 

In this experiment, we evaluate our math operation rep- 
resentation learning methods on the feedback type classi- 
fication task. These feedback items were automatically de- 
ployed by Cognitive Tutor for incorrect steps in the BUG sub- 
set. We pre-processed these steps and grouped the detailed 
feedback items according to the students’ errors that each 
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Method Accuracy 

GRU+C 75.35 + 1.41 
TE+C 78.71 41.74 
CNN+C 67.23 + 1.54 
TE + TransE | 69.15+1.13 
TE + TransR | 73.21 + 1.63 


Table 5. Feedback type classification accuracy for all meth- 
ods on the BUG subset. Tree embedding-based encoding out- 
performs other encoding methods while TransE and TransR 
frameworks do not reach similar performance levels due to 
shortage of training data. 


feedback item addresses and narrowed it down to a total 
of 24 types that occur multiple times. We perform 5-fold 
cross validation on this subset. Table 5 shows the averages 
and standard deviations of feedback classification accuracy 
for all methods on this task across the five folds. We see 
that due to the limited size of the BUG subset (only 5,744 
steps) and the high number of classes (24), all method per- 
form worse than they do on the math operation classification 
task. Specifically, we see that the tree embedding-based en- 
coder in conjunction with a classifier performs best while 
GRU-based encoding also performs well. This result shows 
that although tree embeddings are superior at capturing the 
meaning of math expressions, their advantage over simple 
encoding methods such as GRU-based encoding decreases 
due to increased noise in the data; some math expressions 
submitted by students in incorrect steps are ill-posed and 
do not make sense. Using the TransE and TransR frame- 
works result in slightly worse performance than classifiers 
since these methods explicitly learn a representation for each 
math operation, which limits their performance on this task 
due to the shortage of training data. However, since they 
capture the structural difference in math expressions before 
and after the step, they can cancel out some of the noise in 
erroneous steps, resulting in acceptable performance. 


5.5 Discussions 

Overall, we find that the GRU-based and tree embedding- 
based math expression encoders in conjunction with a classi- 
fier perform almost equally well in most situations, while the 
CNN-based encoder performs worse. The tree embedding- 
based encoder has stronger generalizability across different 
data distributions. We believe that as the math expressions 
and operations get more complicated, methods that lever- 
age the tree structure of math expressions would be more 
advantageous. We also observe that TransR outperforms 
TransE most of the time, although in some experiments us- 
ing TransE and TransR to explicitly learn math operation 
embeddings lead to slightly worse performance than clas- 
sifiers using implicit representations of math expressions. 
However, TransE and TransR are much more powerful and 
enable us to study more tasks such as clustering solution 
steps and identifying typical student errors and learning so- 
lution strategies; See Section 6 for a detailed discussion. 


6. CONCLUSIONS AND FUTURE WORK 


In this paper, we developed a series of methods to learn 
representations of math operations by observing how math 
expressions change as a result of these operations in step-by- 


step solutions to open-ended math questions. Our methods 
leverage math expression encoding methods that map tree- 
structured math expressions into a math embedding vector 
space. We demonstrated the effectiveness of our methods 
on a dataset containing detailed student solution steps to 
equation solving questions in the Cognitive Tutor system on 
two tasks: i) classifying the math operation applied in each 
step and ii) classifying the feedback the system deploys for 
each incorrect step. Results show that our learned math 
operation representations are meaningful and can often ef- 
fectively generalize across different data distributions such 
as questions with different difficulty levels. 


However, the success of our methods heavily depends on the 
availability of diverse large-scale training data. The Cogni- 
tive Tutor dataset that we used in this work represents a 
heavily restricted solution process since the list of math op- 
erations a student can apply in a step is pre-defined. There- 
fore, additional work has to be done to extend our method 
to truly open-ended step-by-step solution processes that are 
less structured. Moreover, our methods are restricted to a 
single solution step only and do not consider the relation- 
ship across multiple steps, which is related to another im- 
portant aspect of solving open-ended math questions: the 
overall solution strategy, i.e., which math operation to apply 
next. Furthermore, in both classification tasks, using tree 
embeddings to encode math expressions in conjunction with 
a classifier outperforms explicitly learning vectorized repre- 
sentations of math operations in the TransE and TransR 
frameworks. However, these explicit representations may 
enable us to perform other tasks such as Nevertheless, our 
work provides a series of tools to analyze the math expres- 
sions students write down in their solutions by bridging the 
gap between symbolic math representations with continuous 
representations in vector spaces, enabling the use of state- 
of-the-art neural network-based methods. We believe that 
this work can potentially open up a new line of research that 
studies how to automatically analyze student solutions for 
grading and feedback purposes. 


There are many avenues of future work. First, since most 
real-world open-ended solutions contain a mixture of math 
expressions and text, there is a need to learn a joint represen- 
tation of math expressions and text in a shared embedding 
space. Second, this joint representation will enable us to 
train automated feedback generation methods in an end-to- 
end manner, using sequence-to-sequence learning methods 
[41]. Third, using learned math expression representations 
as the states and learned math operation representations 
from the TransE and TransR frameworks as the state transi- 
tion model, we can apply reinforcement learning and inverse 
reinforcement learning methods to learn solution strategies, 
i.e., which math operation to apply in the next step. We can 
also study solution strategies employed by real students [33] 
and diagnose their errors and design corresponding feedback 
mechanisms to improve their learning outcomes. These fu- 
ture work directions will enable us to tap into the full poten- 
tial of explicit math operation representations, which is not 
fully demonstrated in this paper: on the CogTutor dataset, 
the only relevant real-world dataset we found, we could only 
evaluate these explicit representations on the math opera- 
tion and feedback prediction tasks, where they may not out- 
perform tree embedding-based classification-based methods. 
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ABSTRACT 


Classifying educational forum posts is a longstanding task 
in the research of Learning Analytics and Educational Data 
Mining. Though this task has been tackled by applying 
both traditional Machine Learning (ML) approaches (e.g., 
Logistics Regression and Random Forest) and up-to-date 
Deep Learning (DL) approaches, there lacks a systematic 
examination of these two types of approaches to portray 
their performance difference. To better guide researchers 
and practitioners to select a model that suits their needs 
the best, this study aimed to systematically compare the 
effectiveness of these two types of approaches for this spe- 
cific task. Specifically, we selected a total of six repre- 
sentative models and explored their capabilities by equip- 
ping them with either extensive input features that were 
widely used in previous studies (traditional ML models) 
or the state-of-the-art pre-trained language model BERT 
(DL models). Through extensive experiments on two real- 
world datasets (one is open-sourced), we demonstrated that: 
(i) DL models uniformly achieved better classification re- 
sults than traditional ML models and the performance dif- 
ference ranges from 1.85% to 5.32% with respect to differ- 
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ent evaluation metrics; (ii) when applying traditional ML 
models, different features should be explored and engineered 
to tackle different classification tasks; (iii) when applying 
DL models, it tends to be a promising approach to adapt 
BERT to the specific classification task by fine-tuning its 
model parameters. We have publicly released our code at 
https: //github.com/1sha49/LL_EDU_FORUM_CLASSIFIERS 


Keywords 
Educational Forum Posts, Text Classification, Deep Neural 
Network, Pre-trained Language Models 


1. INTRODUCTION 


In the past two decades, researchers have developed a num- 
ber of online educational systems to support learning, e.g., 
Massive Open Online Courses, Moodle, and Google Class- 
room. Though being widely recognized as a more flexible 
option compared to campus-based education, these systems 
are often limited by their asynchronous mode of delivery 
that may hinder effective interaction between instructors 
and students and between students themselves [20]. As 
a remedy, the discussion forum component is often included 
to support communication between instructors and class- 
mates, so students can create posts for different purposes, 
e.g., to ask questions, express opinions, or seek technical 
help. Moreover, in certain cases, instructors rely heavily on 
the use of a discussion forum to promote peer-to-peer col- 
laboration, e.g., specifying a topic to spur discussions among 
students. 


In this context, the timeliness of an instructor’s response to a 
student post becomes critical. A group of studies has demon- 
strated that students’ learning performance and course ex- 
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perience were greatly affected by the timeliness of the re- 
sponses they received from instructors [14]. It is, 
therefore, critical that instructors monitor the discussion fo- 
rum to provide timely help to students who need it and 
ensure the discussion unfolds in a way that benefits all stu- 
dents. However, nowadays, up to tens of thousands of stu- 
dents can enroll in an online course and create a variety of 
posts that differ by importance, i.e., not all of them warrant 
instructors’ immediate attention. Therefore, it becomes in- 
creasingly challenging for instructors to timely identify posts 
that require an urgent response or to understand how well 
students collaborate in the discussion space. 


To tackle this challenge, various computational approaches 
have been developed across different courses and domains to 
classify educational forum poe e.g., to distinguish between 
urgent and non-urgent posts |2 or S label posts for dif- 
ferent levels of cognitive presence |11| Typically, these 
approaches relied upon traditional ME Sehne Learning (ML) 
models, such as Logistic Regression, Support Vector Ma- 
chine (SVM), and Random Forest. These models yielded a 
high level of accuracy, most often due to the extensive efforts 
that domain experts made to engineer input features. For 
post classification tasks, such features are linguistic terms 
describing the post content (e.g., words that represent nega- 
tive emotions) and the post metadata (e.g., a creation times- 


tmp) (35 0) 


In recent years, Deep Learning (DL) models have emerged 
as a powerful strand of modeling approaches to tackle data- 
intensive problems. Compared to traditional ML models, 
DL models no longer requires the input of expert-engineered 
features; instead, they are capable of implicitly extracting 
such features from data with a large number of computa- 
tional units (i.e., artificial neurons). Particularly, DL models 
have achieved great success in solving various Natural Lan- 
guage Processing (NLP) problems, e.g., machine translation 
48], semantic parsing [22], and named entity recognition 
60]. Driven by this, a few studies have been conducted and 
demonstrated the superiority of DL models over traditional 
ML models in classifying educational forum posts 
[59]. For instance, Guo et al. showed that DL models 
can outperform a decision tree based ML model proposed 
in [2] 2) by 0.1 Say by F1 score) in terms of identifying 
urgent post, while demonstrated that, when determin- 
ing whether a post a tains a question or not, the perfor- 
mance difference between SVM and DL mores was up to 
0.68 (measured by Accuracy). 


Though achieving high performance, DL models have not 
been justified as an always-more-preferable choice compared 
to traditional ML models. The reasons are threefold. Firstly, 
studies investigating the difference in performance between 
traditional ML and DL models have mostly harnessed a 
limited set of traditional ML models for comparison, with- 
out making extensive feature engineering efforts to empower 
those traditional ML models. As an example, compared 
only SVM to a group of DL models, and the SVM model 
in this study incorporated only one type of features, i.e., 
the term frequency—inverse document frequency (TF-IDF) 
score of the words in a post. This implies that the potential 
of the traditional ML models used in existing studies was 
not fully explored and the actual performance difference be- 


tween the two types of models might be smaller than the 
studies to date have reported on. Secondly, researchers and 
practitioners often need to deliberately trade off several rel- 
evant factors before determining which model they should 
use in practice, and classification performance is only one of 
these factors. Other important factors are the availability 
of human-annotated training data and computing resources 
[29]. For instance, compared to traditional ML models, DL 
models demand a much larger amount of human-annotated 
training data, whose creation can be a time-consuming and 
costly process. Besides, efficient training of DL models re- 
quires access to strong computing resources (e.g., a GPU 
server), which may be unaffordable to researchers and prac- 
titioners with a limited budget. Most traditional ML mod- 
els, on the other hand, can be easily trained on a laptop. 
Thirdly, the feature engineering required by traditional ML 
models plays an important role in contributing to a theoret- 
ical understanding of constructs that are not only useful for 
classification of forum posts, but are also informative about 
students’ discussion behaviors, offering instructors insights 
on whether their instructional approach works as expected 


5) 2] Bs. 


To assist researchers and educators select relevant models for 
post classification, this study aims at providing a systematic 
evaluation of the mainstream ML and DL approaches com- 
monly used to classify educational forum posts. Throughout 
this evaluation, we advance research in the field by ensur- 
ing that: (i) sufficient effort is allocated to design as many 
meaningful features as possible to empower traditional ML 
models; (ii) an adequate number of representative ML and 
DL models is included; (iii) the effectiveness of selected mod- 
els is examined by using more than one dataset, thus adding 
to the robustness of our approach to different educational 
contexts; (iv) all models are compared in the same exper- 
imental setting, e.g., with same training/test data splits, 
and performance reported on widely-used evaluation met- 
rics to provide common ground for model comparison; and 
(v) the coding schemes used labeling discussion posts are 
made publicly available to motivate the replication of our 
study. Formally, the evaluation was guided by the following 
two Research Questions: 


RQ1 To what extent can traditional ML models accu- 
rately classify educational forum posts? 


RQ2 What is the performance difference between tradi- 
tional ML models and DL models in classifying ed- 
ucational forum posts? 


To answer the RQs, we chose two human-annotated datasets 
collected at two educational institutions: Stanford Univer- 
sity and Monash University. We further conducted the 
evaluation as per the following two classification tasks: (i) 
whether a post requires an urgent response or not; and (ii) 
whether the post content is related to knowledge and skills 
taught in a course. Specifically, to answer RQ1, we first 
surveyed relevant studies that reported on applying tradi- 
tional ML models to classify educational forum posts. We 
hence selected four models that were commonly utilized, i.e., 
Logistics Regression, Naive Bays, SVM, and Random For- 
est. In particular, we collected features frequently employed 
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in the reviewed studies and incorporated them as an input 
to empower the four traditional ML models in our experi- 
ment. Given that these features may play different roles in 
different classification tasks, we further conducted a feature 
selection analysis to shed light on the features that must be 
included in the future application of these models for similar 
classification tasks. 


To answer RQ2, we selected the two widely-adopted DL 
models, Convolutional Neural Network coupled with Long 
Short-Term Memory (CNN-LSTM) and Bi-directional LSTM 
(Bi-LSTM), and compared them to the four selected tradi- 
tional ML models. Recent studies in DL suggested that the 
performance of a model adopted for solving an NLP task 
(CNN-LSTM or Bi-LSTM in our case, denoted as the task 
model for simplicity) can be greatly improved with the aid 
of state-of-the-art pre-trained language models like BERT 
in two ways. Firstly, BERT can be used to transform 
the raw text of a post into a set of semantically accurate 
vector-based representations (i.e., word embedding), which 
comprise the input information for the task model and en- 
able the model to distinguish among multiple characteristics 
of a post. Secondly, BERT can adapt itself to capture the 
unique data characteristics of the task at hand. To this end, 
BERT couples with the task model and learns the model 
parameters. In particular, such flexibility has been demon- 
strated as extremely helpful in the contexts where training 
data was not sufficient. Therefore, we explored the effective- 
ness of BERT in empowering the two DL models selected for 
the experiment. We provide details in Section B] 


Performance of the four traditional ML and two DL models 
were examined by four evaluation metrics commonly used in 
classification tasks, i.e., Accuracy, Cohen’s «, Area Under 
the ROC Curve (AUC), and F1 score. In summary, this 
study contributed to the literature of the classification of 
educational forum posts with the following main findings: 


e Compared to other traditional ML models, Random 
Forest is more robust in classifying educational forum 
posts; 


e Both textual and metadata features should be engi- 
neered to empower traditional ML models; 


e Different features should be designed when applying 
traditional ML models for different classification tasks; 


e DL models tend to outperform traditional ML models 
and the performance difference ranges from 1.85% to 
5.32% with respect to different evaluation metrics; 


e Using the pre-trained language model BERT benefits 
the performance of DL models. 


2. RELATED WORK 


2.1 Content Analysis of Forum Posts 

Across disciplines, educators widely utilize online discussion 
forums to accomplish different instructional goals. For in- 
stance, instructors often provide an online discussion board 
as a platform for students to ask questions and get answers 
about course content [57], argue for/against a particu- 
lar issue and, in that way, engage deeply with course topics 


or work collaboratively on a course project [13]. 


In this process, instructors monitor student involvement by 
reading their posts. At the same time, instructors judge 
student contributions in the discussion task, e.g., whether 
students asked a question that relates to course content vs. 
a question about semester tuition; described their feelings 
about the discussed problem vs. just rephrased the prob- 
lem; or clearly communicated their ideas to classmates in 
a collaborative learning task. Upon identifying posts that 
do not contribute to the forum at the expected level, the 
instructor may intervene accordingly. Sometimes, such an 
intervention needs to be provided immediately (e.g., in a 
case of a post pointing out the error in the practice exam 


key). 


With the increasing popularity of online discussion forums in 
the instructional context, educational researchers have be- 
come interested in conducting content analysis of students’ 
posts to find evidence and extent of learning processes that 
instructors aimed to elicit in online discussion. To this end, 
researchers utilize coding scheme, a predefined protocol that 
categorizes and describes participants’ behaviors represen- 


tative of the observed educational construct 37|, e.g., 
knowledge building [35], critical thinking 35], argu- 


26 
social cues, cognitive/meta-cognitive skills and 2 
depth of cognitive processing [25], and self-regulated 
learning in collaborative learning settings [50]. As per the 
analytical procedure, researchers read student postings and 
apply a code over a unit of analysis that can be determined 
physically (e.g., entire post), syntactically (e.g., paragraph, 
sentence) or semantically (e.g., meaningful unit of text) 
[47]. Content analysis clearly demonstrated a potential to 
capture relevant, fine-grained discussion behaviors and pro- 
vide researchers and educators with warranted inferences 


made from coding data [28]. 


Manual content analysis is time-consuming [25], especially in 
high-enrollment courses with thousands of discussion posts 
that students create. To automate the process of content 
analysis and support monitoring of student discussion activ- 
ity, various computational approaches have been developed 
for post classification. These approaches relied upon tradi- 
tional ML models and DL models and handled four common 
types of post classification tasks: content, confusion, senti- 
ment, and urgency. Below, we expand upon the studies that 
reported on these tasks. 


2.2 Traditional Machine Learning Models 

Educational researchers have applied traditional ML mod- 
els to automate content analysis of online discussion posts 
for different instructional needs. The ML models we iden- 
tified in this review are predominantly based on supervised 
learning paradigm and can be categorized into four general 
methodological approaches: regression-based (e.g., Logistics 
Regression [36]), Bayes-based (e.g., Naive 
Bayes, A| |36]), kernel-based (e.g., SVM 
40} |30]), and tree-based (e.g., Random Forest 
). These models were designed to predict 
an outcome variable that represented the meaning of dis- 
cussion posts across different categories such as confusion, 
sentiment or urgency. For instance, created an SVM 
classifier to differentiate between content-related and non- 
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content-related questions in a discussion thread to help in- 
structors more easily detect content-related discourse across 
an extensive number of student posts in MOOC, while 
implemented a Logistic Regression classifier to detect confu- 
sion in students’ posts and automatically recommend task- 
relevant learning resources to students who need it. 
applied SVM to detect student achievement emotions 
in MOOC forums and studied the effects of those emotions 
on student course engagement. 


In recent years, researchers became increasingly interested 
in analyzing the expression of urgency (e.g., regarding course 
content, organization, policy) in a discussion post |2). For 
example, developed multiple ML classifiers to identify 
posts that need prompt attention from course instructors. 
While researchers mostly implemented supervised ML mod- 
els, here we also note a small group of studies that reported 
on using unsupervised methods to classify forum posts, e.g., 
a lexicographical database of sentiments and minimizing 


entropy |8}. 


Traditional ML models built upon textual and non-textual 
features extracted from students’ posts. Textual features 
characterize content of the discussion post, e.g., presence of 
domain specific words 44), pence of words reflective 
of psychological processes , term frequency 2165), 


emotional and cognitive tone ArT (0) [58| [7 [34| [i], pres. 


ence of predefined hashtags |21], text readability index 

text cohesion metrics 19], and measures of eran: 
ity between message text [19]. Non-textual features, 
on the other hand, include at metadata, e.g., popularity 
views, votes and responses {12} |45) |3 lene eae 
social network users |45| [45] [1] [£8 amestaiitp , type 
(post vs. response) variable that signals So en the 
issue has been resolved or not , the relative position of 
the most similar post |5 viable ‘hat signals whether the 
author of the post is alee the initiator of the thread [51], 

page rank value of the author of current post , indicator 
if a message is the first or last in a ee 38} {31} |1 [19], and 
structure of the discussion thread |5 : 


Researchers computed a variety of evaluation metrics to as- 
sess performance of these models. Classification accuracy 
was commonly applied in studies that we reviewed (e.g., 
). Generally, models achieved classification accuracy 
of 70% to 90% in classifying forum posts across different 
levels of content identification, urgency, confusion, and sen- 
timent. We also note that some authors opted for different 
or ee ay ation metrics, e.g., precision/recall |12\)1| 
[62], AUC , Fl [1] 2], kappa |1| (a Ba. Across the 
models, sthb lized a wide range of different validation 
strategies (e.g., cross validation, train/test split). 


We identified two major challenges researchers should be 
aware of when using traditional machine learning approaches 
to detect relevant content, confusion, sentiment and/or ur- 
gency in a discussion forum. First, traditional machine 
learning approaches usually involve extensive feature engi- 
neering. In the context of post classification, a huge num- 
ber of textual and non-textual features of a post is practi- 
cally available to researchers. Features can be generated us- 
ing different text mining approaches (e.g., dictionary-based, 
rule-based) and can be even produced using other classi- 


fiers (e.g.,[I]). Researchers thus often face a challenge to 
decide which feature subset to choose to best capture educa- 
tional problems (e.g., off-topic posting, misinterpreting the 
discussion task, unproductive interaction with peers) and/or 
learning process of interest (e.g., knowledge building, criti- 
cal thinking, argumentation). For this reason, domain and 
learning experts, including course instructors, learning sci- 
entists, and educational psychologists are often needed to 
define a feature space that aligns with the purpose of an 
online discussion. Second, works in took a va- 
riety of different approaches to validate the classifiers they 
developed in terms of metrics, datasets, and training param- 
eters which makes it hardly possible to directly compare the 
performance of these ML models. 


2.3 Deep Learning Approaches 

To our knowledge, relatively fewer studies attempted to ex- 
plore the effectiveness of DL approaches in classifying edu- 
cational forum posts [54 [59] [10} [24] [8} [3} (6). The DL models 
adopted by these studies, typically, relied on the use of CNN, 
LSTM, or a combination of them. For instance, devel- 
oped a DL model called ConvL, which first used CNN to 
capture the contextual features that are important to discern 
the type of a post, and then applied LSTM to further utilize 
the sequential relationships between these features to assign 
a label to the post. Through extensive experiments, ConvL 
was demonstrated to achieve about 81%~87% Accuracy in 
classifying discussion posts of different levels of urgency, con- 
fusion, and sentiments. In a similar vein, proposed to 
use Bi-LSTM to better make use of the sequential relation- 
ships between different terms contained in a post (i.e., from 
both of the forward and backward directions). By compar- 
ing with SVM and a few DL models, this study showed that 
Bi-LSTM performed the best in determining whether a post 
contained a question or not (72%~75% Accuracy). 


It is worth noting that the success of DL models often de- 
pends on the availability of a large-amount human-annotated 
data for model training (typically tens of thousands at least). 
This, undoubtedly, limits the applicability of DL models in 
tackling tasks with only a small amount of training data 
(e.g., a few thousand). Fortunately, with the aid of pre- 
trained language models like BERT [16], we can still exploit 
the power of DL models [10]. Pre-trained language mod- 
els aim to produce semantically meaningful vector-based 
representations of different words (i.e., word embeddings) 
by training on a large collection of corpora. For instance, 
BERT was trained on English Wikipedia articles and Book 
Corpus, which contain about 2,500 million and 800 million 
words, respectively. Two distinct benefits were brought by 
such pre-trained language models: (i) the word embeddings 
produced by them encode a rich contextual and semantic 
information of the text and can be well utilized by a task 
model (e.g., ConvL described above) to distinguish different 
types of input data; and (ii) a pre-trained language model 
can be adapted to a specific task by concatenating itself to 
the task model and further fine-tuning/learning their param- 
eters as a whole with a small amount of training data. For 
example, showed that BERT was able to boost classifi- 
cation Accuracy up to 83%~92% when distinguishing posts 
of different levels of confusion, sentiment, and urgency. 


Though gaining some impressive progress, the studies de- 
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Table 1: The features used as input for traditional ML models. The features used to train models are denoted as Yes under 


the column Included 


Category | Feature Description # features | Studies used this feature | Included 
# unigrams Only the top 1000 most frequent unigram/bigrams 2000 12||57/l40/l2!I56!|62!151 
and bigrams are included. 
Post length # words contained in a post. 1 
The term frequency-inverse document frequency Yes 
eee (TF-IDF) of the top 1000 most frequent unigrams. ae ar 
Textual A 7 
utomate' pets cand 
readability intiex A score € [0,100] specifying the post readability. 1 
A set of features denoted as scores € [0, 100] 
indicating the characteristics of a post from various 
LIwc textual categories including: language summary, 84 allsail7\|34ll31{I3sll19 
affect, function words, relativity, cognitive process, 
time orientation, punctuation, personal concerns, 
perceptual process, grammar, social and drives. 
Word averias The fraction of words that appeared previously in 7 7 
the same post thread. 
# domain-specific Words selected by expert to characterize a specific 61 laa No 
words subject, e.g., “equation” and “formula” for Math. 7 
; ‘ Words that are specific to topics discovered by 
eee ea applying the topic modeling method Latent Dirichlet - 62})4)/44 
allocation. 
A set of features indicating text coherence (i.e., 
Coh-Metrix co-reference, referential, causal, spatial, temporal, 7 31 {138 
and structural cohesion) linguistic complexity, 
text readability, and lexical category. 
ek santlany A score indicating the average sentence similarity : 31 
within a message. 
Hashtags pre-defined by instructors to characterize 
Hashtags the type of a post, e.g., #help and #question for - 21 
confusion detection. 
# views The number of views that a post received. 1 
Datngmnetapost A binary label to indicate whether a post is 1 4sili\l18 Yes 
anonymous to other students. 
Metadata | Creation time The day and the time when a post was made. 2 
#£ votes The number of votes that a post received. 1 
Past ayne A binary label to indicate whether a post is 1 36 Il 
a response to another post. 
Response time The amount of time before a post was responded. - 
## responses The number of responses that a post received. - No 
‘ . A binary label to indicate whether the issue has 
Discussion status - 61 
been resolved or not. 
A number assigned to a post to indicate its . 
Comment Depth chronological position within a discussion thread. 7 pe 
F A binary label to indicate whether the post is the 
Eten Taet oe first or the last in a discussion thread respectively. 7 a |e 


scribed above were often limited in providing a systematic 
comparison between the proposed DL models and existing 
traditional ML models. In other words, these studies either 
did not include traditional ML models for comparison 
or only compared DL models with only one or two tra- 
ditional ML models and the potential of these traditional 
ML models might be suppressed due to a limited amount of 
efforts spent in feature engineering 59]. This necessitates 
a systematic evaluation of the two strands of approaches so 
as to better guide researchers and practitioners in selecting 


models for classifying educational forum posts. 


3. METHODS 


We open this section by describing the datasets used in our 
study. Then, we introduce the representative traditional 
ML models, including the set of features we engineered to 
empower those models (RQ1), and then describe the two 
DL models we chose to compare to the four traditional ML 
models (RQ2). 
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3.1 Datasets 


To ensure a robust comparison between traditional ML and 
DL models in classifying educational forum posts, we adopted 
two datasets in the evaluation, briefly describe below. 


Stanford-Urgency consists of 29,604 forum posts collected 
from eleven online courses at Stanford University. These 
courses mainly cover subjects like medicine, education, hu- 
manities, and sciences. To our knowledge, this dataset is one 
of the few open-sourced datasets for classifying educational 
forum posts and was widely used in previous studies 
[21]. In particular, Stanford-Urgency contains 
three types of human-annotated labels, including the degree 
of urgency of a post to be handled by an instructor, the 
degree of confusion expressed by a student in a post, and 
the sentiment polarity of a post. In line with the increasing 
research interest in detecting urgent posts 3], we 
used Stanford-Urgency and focused on determining the lev- 
els of urgency of posts in this study. The count of urgent 
and non-urgent posts is 6,418 (22%) and 23,186 (78%), re- 
spectively. Originally, the urgency label was assigned on a 
Likert scale of [1, 7], with 1 denoting being not urgent at all 
and 7 denoting being extremely urgent, respectively. Sim- 
ilar to previous studies 2], we pre-processed the data by 
treating those of value larger than or equal to 4 as urgent 
posts and those less than 4 as non-urgent posts, and the 
classification task became a binary classification problem. 
It is worth pointing out two notable benefits of including 
Stanford-Urgency: (i) the large number of posts contained 
in Stanford-Urgency provided sufficient training data for DL 
models; and (ii) in addition to the text contained in a post, 
Stanford-Urgency contains rich metadata information about 
the post, e.g., the creation time of a post, whether the cre- 
ator of a post was anonymous to other students, the number 
of up-votes a post received, which enabled us to explore the 
predictive utility of different types of data. 


Moodle-Content was collected by Monash University, the 
dataset contains 3,703 forum posts that students generated 
in the Learning Management System Moodle during their 
coursework in courses like arts, design, business, economics, 
computer science, and engineering. The posts were first 
manually labelled by a junior teaching staff and then in- 
dependently reviewed (and corrected if necessary) by two 
additional senior teaching staff to ensure the correctness of 
the assigned labels. In contrast to Stanford-Urgency, this 
dataset contains labels to indicate whether a post was re- 
lated to the knowledge and skills taught in a course or not, 
e.g., “What ts poly-nominal regression?” (relevant to course 
content) vs. “When is the due date to submit the second as- 
signment?” (irrelevant). The count of content-relevant and 
content-irrelevant posts is 2,339 (63%) and 1,364 (37%), re- 
spectively. Therefore, similar to the adoption of Stanford- 
Urgency, we also tackled a binary classification problem 
here. However, it should be noted that, compared to Stanford- 
Urgency, the metadata of posts were not available in Moodle- 
Content. 


3.2 Traditional Machine Learning Models 

Model Selection. To ensure our evaluation is systematic, 
we included representative models that emerged in previ- 
ous studies. As summarized in Section [2.2] the traditional 


ML models commonly investigated to date can be roughly 
grouped into four categories, i.e., regression-based, Bayes- 
based, kernel-based, and tree-based. Therefore, we selected 
one model from each group and explored their capabilities 
in classifying educational forum posts, namely Logistics Re- 
gression, Naive Bayes, SVM, and Random Forest. 


Feature Engineering. Different from previous studies 
5], we argued that traditional ML models should involve an 
extensive set of meaningful features to fully unleash their 
predictive potential before being compared to DL models, 
specifically, we expected that ML models demonstrate im- 
proved performance when utilising more features. There- 
fore, we surveyed studies that reported on applying tradi- 
tional ML models to classify educational forum posts, engi- 
neered features following previous studies and incorporated 
those features into the four traditional ML models, as sum- 
marized in Table These features can be classified into 
two broad categories: (i) textual features that are extracted 
from the raw text of a post with the aid of NLP techniques; 
and (ii) metadata features about a post. As the metadata 
of posts was not available in Moodle-Content, only textual 
features were engineered for this dataset, while both tex- 
tual and metadata features were engineered for Stanford- 
Urgency. We excluded several types of features from the 
evaluation, mainly due to the unavailability of the data re- 
quired to engineer those features, e.g., # domain-specific 
words, and Hashtags. As for LDA-identified words, Coh- 
Metriz, and LSA similarity, we have left these features to 
be explored in our future work. 


Feature Importance Analysis. Previous studies 
have demonstrated the benefits of feature importance anal- 
ysis in providing a theoretical understanding of the underly- 
ing constructs that are useful to classify educational forum 
posts, e.g., identifying features that are useful across differ- 
ent classification tasks. Therefore, we adopted the following 
approach to identify the top k most important features of 
an ML model: 


1. the Chi-squared statistics between engineered features 
and the target classification labels were computed; 


2. each time, the feature of the highest Chi-squared statis- 
tic was fed into the model and the feature was kept in 
the set of input features only if the classification per- 
formance had increased; 


3. we repeated (2) until k most important features were 
identified. 


3.3. Deep Learning Models 

Existing studies on developing DL models to characterize 
different types of forum posts, typically, involved the use of 
CNN or LSTM, which motivated us to include the following 
two DL models to our evaluation: 


e CNN-LSTM [59]. This model consists of: (i) 


an input layer, which learns an embedding representa- 
tion for each word contained in the input test; (ii) a 
CNN layer, which performs a one-dimensional convo- 
lution operation on the embedding representation pro- 
duced by the input layer and captures the contextual 
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information related to each word; (iii) an LSTM layer, 
which takes the output of the CNN layer to make use 
of the sequential information of the words; and (iv) a 
classification layer, which is fully-connected layer tak- 
ing the output of the LSTM layer as input to assign a 
label to the input text. 


e Bi-LSTM [6]. Though LSTM has been demon- 
strated as somewhat effective in utilizing the sequential 
information of long input text, they are limited in only 
using the previous words to predict the later words in 
the input text. Therefore, Bi-directional LSTM was 
proposed, which consists of two LSTM layers, one rep- 
resenting text information in the forward direction and 
the other in the backward direction to better capture 
the sequential information between different words. For- 
mally, this model consists of: (i) an input layer (same 
as CNN-LSTM); (ii) a Bi-LSTM layer; and (iii) a clas- 
sification layer (same as CNN-LSTM). 


Both CNN-LSTM and Bi-LSTM use an input layer to learn 
the representation of the input text, i.e., embeddings of the 
words in a post. Instead of learning word embeddings dur- 
ing training, previous studies suggested that pre- 
trained language models like BERT can be used to initialize 
embeddings. Such embedding initialization has been demon- 
strated as an effective way to facilitate a task model to ac- 
quire better performance. Therefore, we adopted BERT to 
initialize the input layer of both CNN-LSTM and Bi-LSTM 
and, correspondingly, the implemented models are denoted 
as Emb-CNN-LSTM, and Emb-Bi-LSTM, respectively. 


In addition to word embeddings initialization, as suggested 
in recent studies in the field of NLP [33], we can fur- 
ther couple BERT with a task model (i.e., CNN-LSTM or 
Bi-LSTM) and adapt BERT to suit the unique character- 
istics of a task by training BERT and the task model as 
a whole. In other words, the task model is often concate- 
nated on top of BERT’s output for the [CLS], which is a 
special token used in BERT to encodes the information of 
the whole input text. The co-training of BERT and the 
task model enables BERT to fine-tune its parameters to 
produce task-specific word embeddings for the input text, 
which further facilitates the task model to determine a suit- 
able label for the input. In fact, this fine-tuning strategy, 
compared to being used for embedding initialization, has 
been demonstrated as a more promising approach to make 
use of BERT. For instance, showed that, even by sim- 
ply coupling with a classification layer (i.e., the last layer 
of CNN-LSTM and Bi-LSTM), BERT was capable of ac- 
curately classifying 92% forum posts. Most importantly, it 
should be noted that the parameters of the coupled model 
can be well fine-tuned/learned with only a few thousand 
data samples. That means, this fine-tuning strategy enables 
CNN-LSTM and Bi-LSTM to be also applicable to tasks 
that deal with only a small amount of data, e.g., Moodle- 
Content in our case. In summary, we fine-tuned BERT after 
coupling it with CNN-LSTM (CNN-LSTM-Tuned) and Bi- 
LSTM (Bi-LSTM-Tuned), respectively. Besides, to gain a 
clear understanding of the effectiveness of this fine-tuning 
strategy, we coupled BERT with only a single classifica- 
tion layer (denoted as SCL-Tuned) and compared it with 
CNN-LSTM-Tuned and Bi-LSTM-Tuned. Table [2] provides 


a summary of the DL models implemented in this study. 


Table 2: The DL models used in this study. Here, SCL 
denotes Single Classification Layer. 


Models Usage of BERT Task Model 

Bare ei Fine-tuning | CNN-LSTM Bi-LSTM SCL 
Emb-CNN-LSTM i) ) 
Emb-Bi-LSTM JV JV 


CNN-LSTM-Tuned 
Bi-LSTM-Tuned 
SCL-Tuned 


v 


Vv 


SIS IS 


3.4 Experiment Setup 


Data pre-processing. Training and testing data were ran- 
domly split in the ratio of 8:2. The Python package NLTK 
was applied to perform lower casing and stemming on the 
raw text of a post after removing the stop words. 


Evaluation metrics. In line with previous works in classify- 
ing educational forum posts, we adopted the following four 
metrics, i.e., Accuracy, Cohen’s «, AUC, and F1 score, to 
examine model performance. We ran each model three times 
and reported the averaged results. 


Model implementation and training. The traditional ML 
models (i.e., Logistics Regression, Naive Bays, SVM, and 
Random Forest) were implemented with the aid of the Python 
package scikit-learn and their parameters were determined 
by applying grid search and fit the grid to the training 
data. Note all model hyper-parameters will be documented 
in the released GitHub repository. The ML models were 
trained with textual and metadata features for the Stanford- 
Urgency dataset, and trained with textual features for the 
Moodle-Content dataset. When applying the method de- 
tailed inB.2]to perform feature importance analysis, we used 
F1 score as the metric to measure the changed model per- 
formance. For both CNN-LSTM and Bi-LSTM, the model 
parameters are selected to be comparable with similar pre- 
vious works in {10}. To this purpose, the size of 
the BERT embeddings used in the input layer was 768 and 
the number of hidden units used in the final classification 
layer was 1. We used the activation function sigmoid and 
L2 regularizer. In CNN-LSTM, the CNN layer was set to 
have 128 convolution filters with filter width of 5, while the 
LSTM layer was set to have 128 hidden states and 128 cell 
states. In Bi-LSTM, the number of the hidden states and 
cell states in the LSTM cells was both set to 128. For all DL 
models, (i) 10% of the training data was randomly selected 
as the validation data; (ii) the batch size was set to 32 and 
the maximum length of the input text was set to 512; (iii) 
the optimization algorithm Adam was used; (iv) the learning 
rate was set by applying the one cycle policy with maximum 
learning rate of 2e-05; (v) the dropout probability was set 
to 0.5; and (vi) the maximum number of training epochs 
was 50 and early stopping mechanisms were used when the 
model performance on the validation data starts to decrease, 
and data shuffling was performed at the end of each epoch. 
The best model is selected based on validation error. For 
BERT, we used the service provided by Bert-as-servicd_| 


"https: //github.com /hanxiao/bert-as-service 
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Table 3: The performance of traditional ML models. The results in bold represent the best performance in each task. 


Stanford-Urgency Moodle-Content 


Methods Accuracy Cohen’sk AUC Fl | Accuracy Cohen’sk AUC F1 

Naive Bays 0.7536 0.5071 0.7762 0.7844 0.7183 0.4736 0.7210 0.6870 
SVM 0.8627 0.7347 0.8630 0.8185 0.7536 0.5900 0.7536 0.7530 
Random Forest 0.8915 0.7892 0.8916 0.8918 0.7544 0.5927 0.7551 0.7661 
Logistic Regression 0.8068 0.6287 0.8068 0.7638 0.7339 0.5251 0.7357 0.7547 


Table 4: The performance of Random Forest on Stanford-Urgency when using different types of features as input. The results 
in bold represent the best performance. 


Types of Features Accuracy Cohen’sx AUC Fl 


Textual 0.8639 0.7368 0.8642 0.8652 
Metadata 0.8150 0.6442 0.8152 0.8136 
Textual + Metadata 0.8915 0.7892 0.8916 0.8918 


Table 5: The performance of Random Forest when only using the top-10 most important features (Table |6) as input. The 
fractions within brackets indicate the decreased performance compared to those with all available features as input (Table[3). 


Stanford-Urgency 


Moodle-Content 


Accuracy Cohen’s & AUC F1 


0.8610 (-3.42%) 0.7315 (-7.31%) 0.8617 (-3.35%) 0.8628 (-3.25%) 


4. RESULTS 


Results on RQ1. The performance of the four traditional ML 
models is presented in Table Across both classification 
tasks, Random Forest achieved the best performance, as per 
the calculated evaluation metrics, followed by SVM and Lo- 
gistics Regression. Naive Bayes, on the other hand, achieved 
the lowest performance. Specifically, Random Forest was ca- 
pable of accurately classifying almost 90% of the forum posts 
in Stanford-Urgency, and reached an AUC and F1 score of 
0.8916 and 0.8918, respectively. Besides, Cohen’s & score 
achieved by Random Forest for the same dataset was 0.7892, 
which indicates a substantial (and almost perfect) classifica- 
tion performance. In terms of classifying Moodle-Content, 
we noticed the overall performance of all models was lower 
than in Stanford-Urgency. This may be attributed to the 
lack of metadata features and significantly fewer posts in 
Moodle-Content than in Stanford-Urgency, making it harder 
for the models to reveal characteristics of different types of 
posts in Moodle-Content. Still, Random Forest achieved an 
overall accuracy, AUC, and F1 score of 0.7544, 0.7551, and 
0.7661, respectively, and Cohen’s «& score was very close to 
0.6, which indicates an almost substantial classification per- 
formance. 


Before delving into the identification of the most predictive 
features, we submitted each group of the textual and meta- 
data features to the best-performing ML model (i.e., Ran- 
dom Forest) to depict their overall predictive power. The 
results are given in Table derived only from Stanford- 
Urgency due to the unavailability of the metadata features 
in Moodle-Content. We observe that both textual and meta- 
data features were useful in boosting classification perfor- 
mance, and textual features seem to have had a stronger ca- 
pacity in distinguishing urgent from non-urgent posts. For 
instance, when only taking textual features into considera- 


Accuracy Cohen’s « AUC Fl 


0.7175 (-4.89%) 0.5577 (-5.91%) 0.7186 (-4.83%) 0.7358 (-3.96%) 


tion, the AUC score was 0.8462, which is about 6% higher 
than that of metadata features (0.8152) and only 5% lower 
than that when considering both textual features and meta- 
data features. 


To gain a deeper understanding of the predictive power of 
different features, we further applied the method described 
in Section [3.2] to select the top 10 most important features 
in both Stanford-Urgency and Moodle-Content, described in 
Table |6] Here, several interesting observations can be made. 


Firstly, almost all of the identified features were textual fea- 
tures, with only one exception observed in Stanford-Urgency, 
i.e., the metadata feature # views. This is in line with the 
findings we observed in Table i.e., compared to meta- 
data features, textual features tended to make a larger con- 
tribution in classifying forum posts. Among those textual 
features, we should also notice that most of them were ex- 
tracted with the aid of LIWC. This corroborates with the 
findings presented in previous studies [31}[38} [19], i.e., LIWC 
is a useful tool in identifying meaningful features for char- 
acterizing educational forum posts. 


Secondly, there is little overlap regarding the top ten most 
important features in the two tasks (only two shared feature, 
ie., LIWC: pronoun and LIWC: posemo). In particular, 
we note that the number of features was highly related to 
the context of a classification task. In the Stanford-Urgency 
case, a number of top features were associated with a sense of 
stimulation (e.g., anxiety, affect, drive), which represents a 
subjective representation of urgency. In the Moodle-Content 
case, features were more associated with a sense of investi- 
gation (e.g., Analytic and Understand). This shows that dif- 
ferent classification tasks (i.e., Urgency vs. Content-related) 
require task-specific features to best capture the task-specific 
information (i.e., whether the post expressed a sense of ur- 
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Table 6: The top 10 most important features used in Random Forest. Features shared by the two tasks are in bold. 


Stanford-Urgency 


Moodle-Content 


Features 


Description 


Features 


Description 


Metadata: # views 


The number of views that a post received. 


Post length 


# words contained in a post. 


LIWC: pronoun 


# of the occurrence of all pronouns (e.g., personal 
and impersonal pronouns) 


LIWC: Analytic 


A score indicating the formal, logical, and 
hierarchical thinking patterns in a post 


Unigram: they 


# of the occurrence of the word “they” 


LIWC: Tone 


A score indicating the emotional tone 
conveyed in a post 


LIWC: number 


# of the occurrence of the digital numbers 


LIWC: pronoun 


# of the occurrence of all pronouns (e.g., personal 
and impersonal pronouns) 


A score indicating the overal emotion (positive and 


# of the occurrence of all personal pronoun 


IW: affect negative) of a post EIN Ce Dpto. (e.g., he, she, me) in a post 
LIWC: posemo A score indicating the positive emotion of a post Unigram: I # of the occurrence of the word “J” 
LIWC: drives lasses mic anne nea THOLIVES, ane reno LIWC: posemo A score indicating the positive emotion of a post 
a post (e.g., references to success and failure) 
A score indicating the power of a post (e.g., a . 
LIWC: power : TF-IDF: understand The TF-IDF score of the word “understand 
reference to dominance) 
Bs apg trol ct : ; eas A indicating th ity for enjoying close, 
LIWC: anx A score indicating the anxiety conveyed in a post LIWC: affiliation Btls NEON Une CODON, TOE SOY ENS C:O8Es 
harmonious relationships conveyed in a post 
LIWC: QMark # of the occurrence of question mark LIWC: Exclam # of the occurrence of exclamation mark 


Table 7: The performance of DL models. 


The results in bold represent the best performance in each task. The fractions 


within brackets indicate the increased performance compared to the best performance achieved by Random Forest (Table[3). 


Models Accuracy Cohen’s & AUC Fl 

1. Emb-CNN-LSTM 0.9203 (3.23%) 0.8192 (3.80%) 0.9201 (3.20%) 0.9203 (3.20%) 
Stanford- 2. Emb-Bi-LSTM 0.9159 (2.73%) 0.8051 (2.01%) 0.9153 (2.66%) 0.9159 (2.71%) 
Urgency 3. CNN-LSTM-Tuned 0.9211 (3.32%) 0.8210 (4.02%) 0.9221 (3.42%) 0.9221 (3.40%) 

4, Bi-LSTM-Tuned 0.9210 (3.30%) 0.8196 (3.85%) 0.9208 (3.28%) 0.9210 (3.27%) 

5. SCL-Tuned 0.9210 (3.31%) 0.8206 (3.98%) 0.9215 (3.35%) 0.9219 (3.38%) 
Migoetes 6. CNN-LSTM-Tuned 0.7934 (5.17%) 0.6230 (5.11%) 0.7952 (5.32%) 0.7993 (4.33%) 
Canine 7. Bi-LSTM-Tuned 0.7854 (4.11%) 0.6220 (4.93%) 0.7901 (4.64%) 0.7913 (3.29%) 

8. SCL-Tuned 0.7716 (2.29%) 0.6092 (2.77%) 0.7733 (2.42%) 0.7803 (1.85%) 


gency). 


Moreover, when solely using the top 10 features as an input, 
the performance of Random Forest was 3.25%~7.31% lower 
than the performance obtained after incorporating all avail- 
able features (Table |5). This finding hence confirms that 
while the traditional ML models can achieve good classifica- 
tion performance using only the top 10 best features, there 
is still potential for improvement when using more features. 
Hence, researchers should attempt to apply more features to 
fully unleash traditional ML models’ capability. 


Results on RQ2. The performance of the implemented DL 

models is presented in Table As Moodle-Content con- 

tained only 3,703 labeled posts, that was likely to be insuffi- 

cient to support the training of CNN-LSTM or Bi-LSTM 

from scratch. Therefore, we only implemented the fine- 

tuned models, i.e., CNN-LSTM-Tuned, Bi-LSTM-Tuned, and 
SCL-Tuned on Moodle-Content. Several observation can be 

derived based on the results in Table [7] 


Firstly and unsurprisingly, DL models uniformly achieved a 
better performance than traditional ML models. This cor- 


roborates findings reported in [24]. DL models 
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are, therefore, superior to traditional ML models in terms 
of capturing the characteristics of a dataset and obtaining 
better classification results. However, we should note that 
the performance difference between traditional ML models 
and DL models was not that large. Specifically, the best- 
performing model CNN-LSTM-Tuned achieved an improve- 
ment of only 3.32% in Accuracy, 4.02% in Cohen’s k, 3.42% 
in AUC, and 3.40% in F1 score. In particular, the Cohen’s 
& score was 0.8210, which suggests an almost perfect classi- 
fication performance. 


Secondly, contrasting findings reported in , we found that 
CNN-LSTM slightly outperform Bi-LSTM in most cases (i.e., 
Row 1 vs. Row 2, Row 3 vs. Row 4, and Row 6 vs. Row 7 in 
Table[7). Thirdly, instead of using BERT for embedding ini- 
tialization, the classification model would achieve better per- 
formance by fine-tuning BERT by coupling it with the task 
model and training the coupled model as a whole (i.e., Row 
1-2 vs. Row 3-4 in Table (7), though the improvement was 
rather limited, e.g., less than 1% when comparing to that 
of Emb-CNN-LSTM and CNN-LSTM-Tuned on Stanford- 
Urgency. Fourthly, we showed that in Stanford-Urgency, 
by simply coupling BERT with a single classification layer 
(SCL-Tuned, Row 5 in Table (7), the classification perfor- 
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mance was almost as good as those derived by coupling 
BERT with more complex DL models like CNN-LSTM and 
Bi-LSTM (Row 3-4 in Table|7). This implies that, BERT 
can capture the rich semantic information hidden behind a 
post, which can be used to deliver adequate classification 
performance even by employing a single classification layer. 


5. DISCUSSION AND CONCLUSION 


The classification of educational forum posts has been a 
longstanding task in the research of Learning Analytics and 
Educational Data Mining. Though quite some previous stud- 
ies have been conducted to explore the applicability and ef- 
fectiveness of traditional ML models and DL models in solv- 
ing this task, a systematic comparison between these two 
types of approaches has not been conducted to date. There- 
fore, this study set out to provide such an evaluation with 
aiming at paving the road to researchers and practitioners 
to select appropriate predictive models when tackling this 
task. Specifically, we compared the performance of four rep- 
resentative traditional ML models (i.e., Logistics Regression, 
Naive Bays, SVM, and Random Forest) and two commonly- 
applied DL models (i.e., CNN-LSTM and Bi-LSTM) on two 
datasets. We further elaborate on several implications that 
our work may have on the development of classifiers for edu- 
cational forum posts. We also list limitations to be addressed 
in future studies. 


Implications. Firstly, the performance difference between 
traditional ML models and DL models was not as large as 
reported by previous studies (e.g., [59]). More specifically, 
we showed that traditional ML models were often inferior 
to DL models in terms of only 1.85% to 5.32% decrease in 
classification performance measured by Accuracy, Cohen’s 
«, AUC, and F1 score. This finding implies that, when re- 
searchers and practitioners have no access to strong comput- 
ing resources and, for this reason, cannot utilize DL models, 
they can still achieve acceptable classification performance 
by using traditional ML models, as long as those ML models 
incorporate carefully-crafted features. 


Secondly, our results demonstrate that the performance of 
Random Forest classifier is more robust compared to other 
traditional ML models. This implies that other more ad- 
vanced tree-based ML models (e.g., Gradient Tree Boosting 
[9}) might be worth exploring to achieve even higher clas- 
sification performance. Besides, given that the most im- 
portant feature in Stanford-Urgency was # views (Table 

and the models’ performance in Moodle-Content might be 
suppressed due to the unavailability of metadata features, 
it may be worth paying special attention to acquiring and 
using metadata features when applying traditional ML mod- 
els. Another finding suggests that little overlap was detected 
between the top 10 most important features selected in each 
of the two classification tasks (Table |6). This implies when 
tackling a classification task, features should be designed to 
suit the unique characteristics of the task and fit the theo- 
retical model utilized to annotate data (e.g., with predefined 
coding scheme). This aligns with findings presented in 
[38] [19], in different phases of cognitive presence, different im- 
portance scores were obtained for the same features. Lastly, 
researchers and practitioners may wish to take advantage 
of pre-trained language models like BERT when develop- 
ing DL models. Our experiment showed that BERT can be 


effectively used in two ways, ie., (i) to initialize the word 
embeddings of the post text as the input for a task model; 
or (ii) to suit the needs of the specific classification task 
by coupling itself with the task model and then fine-tuning 
model parameters. Particularly, the second way enables DL 
models to be applicable to tasks that deal with only a small 
amount of human-annotated data, like in Moodle-Content). 


Limitations. Firstly, the evaluation presented in this study 
focused only two classification tasks, i.e., Stanford-Urgency 
and Moodle-Content. To further increase the reliability of 
the presented findings, more tasks should be included and 
investigated, e.g., determining the level of confusion that 
a student expressed in a forum post or whether the senti- 
ment contained in the post is positive or negative [54]. 
Secondly, a few types of features were not included when ex- 
ploring the capabilities of traditional ML models in our eval- 
uation, e.g., # domain-specific words and LDA-identified 
words. 'To accurately depict the upper bound of the perfor- 
mance of traditional ML models in classifying educational 
forum posts, it would be worthy to recruit domain experts 
to further engineer and make use of these features. Thirdly, 
we should notice that the DL models used in our evaluation 
(i.e, CNN-LSTM and Bi-LSTM) only utilized the raw text 
of a post as input and left the metadata features untapped. 
Given that metadata features have been demonstrated of 
great importance in the application of traditional ML mod- 
els, future research efforts should also be allocated to design 
more advanced DL models that are capable of using both 
the raw text of a post and the metadata of the post for 
classification. 


Lastly, we acknowledge that, due to the scope of this study, 
we did not attempt to investigate the reasons causing the 
performance difference between traditional ML models and 
DL models, e.g., whether the two categories of models mis- 
classified the same types of messages. In the future, we 
will further investigate whether the performance difference 
between traditional ML models and DL models can be at- 
tributed to their model structures and explore potential meth- 
ods to boost their classification performance, e.g., collecting 
additional forum posts to continue the pre-training of BERT 
before coupling it with a downstream classification model. 
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ABSTRACT 


Engaged and disengaged behaviors have been studied across 
a variety of educational contexts. However, tools to ana- 
lyze engagement typically require custom-coding and cali- 
bration for a system. This limits engagement detection to 
systems where experts are available to study patterns and 
build detectors. This work studies a new approach to clas- 
sify engagement patterns without expert input, by using a 
play persona methodology where labeled archetype data is 
generated by novice testers acting out different engagement 
patterns in a system. Domain-agnostic task features (e.g., 
response time to an activity, scores/correctness, task diffi- 
culty) are extracted from standardized data logs for both 
archetype and authentic user sessions. A semi-supervised 
methodology was used to label engagement; bottom-up clus- 
ters were combined with archetype data to build a classi- 
fier. This approach was analyzed with a focus on cold-start 
performance on small samples, using two metrics: consis- 
tency with larger full-sample cluster assignments and sta- 
bility of points staying in the same cluster once assigned. 
These were compared against a baseline of clustering with- 
out an incrementally trained classifier. Findings on a data 
set from a branching multiple-choice scenario-based tutoring 
system indicated that approximately 52 unlabeled samples 
and 51 play-test labeled samples were sufficient to classify 
holdout sessions at 85% consistency with a full set of 145 un- 
supervised samples. Additionally, alignment to play persona 
samples for the full set matched expert labels for clusters. 
Use-cases and limitations of this approach are discussed. 
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1. INTRODUCTION 


Engagement represents a necessary (though not sufficient) 
condition for learning. Engagement has been shown to im- 
pact learning [4] and persistence [9]. Research has also found 
that engagement is actionable and can be increased [25]. 
This is a particularly important topic for computer-based 
learning: unlike in a classroom, where engagement can be 
assessed and acted on by an instructor in real-time, patterns 
of engagement are often not visible [10]. 


However, building engagement analytics for a new system 
is time consuming. Custom metrics are typically developed 
and then require substantial data to identify patterns (i.e., 
the cold-start problem). Worse, the extensive effort to de- 
sign such analytics is buried in application-specific code. 
While heuristics are available to infer disengagement, such as 
response times under 3 seconds [6], applying these to differ- 
ent systems requires benchmarking and calibrating detectors 
for the content and system. Efforts to analyze engagement 
often start almost from scratch. This is unfortunate, since 
research on behavioral engagement has identified patterns 
which appear to generalize across systems [3, 4, 6, 14, 15]. 


To address this gap, we are researching a service for analyz- 
ing and classifying engagement that relies on a standards- 
based learning record store [1]. This effort is called the Ser- 
vice for Measurement and Adaptation to Real-Time Engage- 
ment (SMART-E). Rather than being optimized to analyze 
a specific system or data set, SMART-E targets three high- 
level goals: 1) Cold-Start Calibration: ability to identify and 
benchmark engagement behaviors, which does not require 
large data sets or in-depth expert analysis; 2) Re-Usability: 
reliance on standards and data available from most learning 
environments; and 3) Actionability: generation of action- 
able insights, which an instructor or adaptive system could 
leverage or investigate further. 


SMART-E is influenced by two techniques: 1) semi-supervised 
learning, which trains with a small set of labeled data and 
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a larger set of unlabeled data and 2) play persona, behav- 
ioral archetypes commonly used for testing and analysis of 
video games [7, 32]. Our paper describes the process and 
findings from applying this approach to a data set from a 
scenario-based tutoring system for training counseling skills. 
Contributions from this work include a) reviewing features 
that generalized engagement analytics should consider, b) 
developing a pipeline for analyzing engagement which does 
not require expert labeling or application-specific feature en- 
gineering, and c) demonstrating the effectiveness of a semi- 
supervised approach with reasonable data requirements (e.g., 
about 50 samples each of labeled and unlabeled data) to ap- 
proximate inferences that experts might make given similar 
data. As such, this research represents a step toward a gen- 
eralized framework for diagnosing learner engagement that 
does not require an expert researcher analyzing data or ob- 
serving subjects. 


2. BACKGROUND AND THEORY 


Across the learning science community, engagement is de- 
fined and measured in vastly different ways, ranging from 
split-second physiological responses (e.g., eye tracking, fa- 
cial affect) to long-term trends lasting months or years (e.g., 
returning to a system, building social ties) [2, 13]. The re- 
search in this paper targets behavioral engagement at the 
task level (e.g., time spent working through a problem) and 
session level (e.g., sustained effort to improve performance 
and learning). 


A key reason for this focus is data availability and data in- 
terpretability. Most systems collect data logs at these levels 
and, as described next, substantial research has also iden- 
tified common behavioral patterns. Research on lower-level 
affective cues (e.g., facial affect) has found certain action- 
able events that generalize (e.g., gaze inattention [15]), but 
other patterns are not trivial to generalize due to differences 
between individuals or across contexts [28]. Moreover, facial 
data is often unavailable due to the privacy issues involved 
with recording learning. Larger time scales are not the focus 
of this work because engagement levels over those time scales 
would require longitudinal data and also are more likely to 
be visible to instructors (e.g., absences). 


2.1 Patterns of Behavioral Engagement 

Behavioral engagement analysis from log files has shown re- 
peated evidence of useful, actionable patterns, such as re- 
sponse time, response time vs. accuracy/correctness inter- 
actions, approach vs. avoidance behaviors relative to prob- 
lem difficulty (e.g., skipping hard problems), and noisiness 
of answer quality (e.g., carelessness) [5]. Response time, par- 
ticularly very fast response time, is one of the most obvious 
features linked to behaviors associated with disengagement 
(e.g., guessing, skipping, straight-lining). For scored tasks, 
the interaction between response time and correctness has 
been extensively researched in the study of basic cognition 
as well as authentic learning tasks [6]. The relationship be- 
tween correctness and time is frequently a logistic relation- 
ship (assuming that time does not directly impact scoring): 
with very fast responses, correctness is approximately ran- 
dom, increasing rapidly to better than chance for more ordi- 
nary response time, and approaching an individual skill-level 
asymptote as time increases. At very large times, answer 


quality may once again decrease, either due to distraction 
(e.g., multi-tasking) or difficulty selecting a final answer [27]. 


More complex interactions often require understanding the 
relative problem difficulty. Research indicates that students 
with poor learning outcomes tend to avoid or abuse hints 
on problems that they find difficult [5]. Conversely, self- 
regulated learners may be more likely to skip or “game” 
through problems that that are easier relative to their skill 
level but dedicate more time to harder problems [33]. While 
not yet investigated, this might also imply that more self- 
regulated learners may be less likely to demonstrate wheel- 
spinning [18] since they are more actively monitoring the 
usefulness of tasks. 


Estimates of answer correctness versus expected correctness 
have also been used, though these are likely most clear when 
the learner is close to mastery. Of these, carelessness and 
“slips” are the most well-established mechanisms [12]. More 
generally, there may be value in investigating any situation 
where correctness appears decoupled from traditional fac- 
tors (e.g., little correlation between time and answer quality, 
little correlation between expect mastery and later perfor- 
mance). However, such decoupling could be due to poor 
task design (e.g., item response issues [20]) or problems un- 
related to engagement (e.g., attention or memory problems), 
so additional context may be needed to interpret this. 


2.2 Archetypes for Behavioral Engagement 
When considering these different patterns of behavioral en- 
gagement and disengagement, we posit that engagement has 
at least two dimensions: a) passiveness vs. activeness and 
b) avoidance vs. approach. For example, passive avoidance 
represents disengagement commonly associated with bore- 
dom such as distraction or skipping through material. By 
comparison, other learners employ short-cut strategies to 
cheat or cherry-pick tasks to minimize effort while still pro- 
viding acceptable performance (active avoidance). A simi- 
lar division exists for engaged learners, in that some study 
almost exclusively on assigned content (passive approach) 
while others monitor and self-regulate their effort to focus 
their learning (active approach). 


These latent engagement factors may be evident through 
different observed patterns. For example, while distraction 
and racing through material both represent disengagement, 
their data patterns will look very different. In considering 
these patterns, we developed the following candidates which 
may be evident across a variety of systems: 


e Diligent (Active Engagement): Spends somewhat more 
time on tasks and shows correspondingly better per- 
formance, and more likely to complete optional tasks. 


e Self-Regulated (Active Engagement): Seeks out and 
spends greater time on harder tasks, but may skip or 
disengage on easier tasks. [22, 33]. 


e Cherry Picking (Active Disengagement): Seeks out 
easier tasks or abuses features to make tasks easier 
(e.g., hint abuse), and avoids harder tasks [3]. 


e Nominal Engagement (Passive Engagement): Com- 
pletes tasks as recommended or assigned, with ordi- 
nary time-on-task and performance. 
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e Expert/Recall (Passive Engagement): Regardless of 
difficulty level, completes tasks very rapidly and with 
high performance. Possibly an expert on the content, 
but might also be shallow recall or lookup. 


e Racing/Guessing (Passive Disengagement): Rapidly 
answers (potentially multiple times) despite relatively 
poor performance [26]. 


e Distracted/Slow (Passive Disengagement): Uncommonly 


delayed or irregular answers, particularly when extra 
time does not appear to improve performance [27]. 


As with prior research on engagement, we do not assume 
that these archetypes are necessarily stable for a specific user 
across all content, but that they represent modes of interac- 
tion during learning. Additionally, these candidate patterns 
are not exhaustive and the specific evidence for each pattern 
may not be identical: while racing through material might 
involve rapid guessing in one system, in another it might in- 
volve skipping material entirely. Historically, this has meant 
that detectors are tuned using expert-labeled observations 
and/or expert feature engineering. 


2.3. Play Persona as a Labeling Methodology 
This work applies a new approach to generating engagement 
labels for user sessions. While substantial research has been 
conducted on engagement, existing methods for determining 
engagement during computer-based learning are challeng- 
ing to scale. Our research is intended to complement three 
methods currently in-use: expert observers, sensor-based af- 
fect detection, and self-report [13]. 


Expert observers can be trained on a specific coding manual 
until they reach high levels of agreement. Using techniques 
such as BROMP [29], a trained observer can monitor and 
label engagement events for multiple students. The primary 
barriers to collecting this data are the number of trained ob- 
servers required and issues of privacy and technology (e.g., 
observing students in online courses). Automated affect de- 
tection (e.g, automated facial affect detection) has also been 
used to analyze engagement [28, 19]. While in principle fa- 
cial affect scales to a large number of learners, engagement is 
hard to interpret without also analyzing behavioral patterns 
(e.g., screen recordings, log files). As with human observers, 
privacy issues may prevent the necessary recording of data. 
Moreover, for both human and automated labeling, while 
learner states may be recorded, they do not include any in- 
terpretation about what strategies a learner is using (e.g., 
focusing on hard vs. easy problems). Self-report offers a 
different type of engagement label. Users can report their 
overall engagement and may also be able to describe the 
learning strategies that they are using [13]. However, self- 
reported engagement can be affected by reporting bias (e.g., 
claiming to be more engaged) or subjectivity of engagement 
ratings. 


To address these limitations, we identified play persona as 
a way to generate labeled engagement data. Play persona 
are behavioral archetypes often used for testing and analy- 
sis of video games, that reflect different goals and behavior 
patterns [35, 32]. For example, in a strategy game there are 


recognized archetypes such as the Builder (invest in long- 
term expansion) versus Greedy-Optimizer (take quick wins) 
[34]. Likewise, research on Massive Multiplayer Games (e.g., 
[36]) has identified behavior archetypes such as competitors 
who focus on head-to-head tasks and explorers who focus on 
exploring the world. Artificial game players can be crafted 
to mimic these play persona for procedural play-testing [21]. 


We hypothesize that play persona methods can also be use- 
ful to identify and label engagement patterns with the mod- 
ification that human testers will act out these roles (e.g., 
diligent) which would be difficult to simulate artificially. If 
this approach is useful, it has at least three advantages over 
existing methods. First, it ensures rapid data collection of 
labels, since rather than having unbalanced labels (i.e., 80% 
of real users might be in one bin), testers can be directed 
to act out a variety of roles. Second, play-test labels should 
be interpretable since the intent of the learner is known, as 
opposed to purely bottom-up patterns or self-reported la- 
bels, which require experts to infer underlying strategies. 
Finally, despite some constraints (e.g., difficulty in faking 
more or less knowledge), dedicated testers may be able to 
play out multiple archetypes and do so repeatedly, reducing 
the need to recruit new testers. 


3. RESEARCH QUESTIONS 


This work investigates techniques to leverage play-testing 
data for detecting engagement patterns. However, this ap- 
proach will only be feasible if testers reasonably approximate 
the behavior of real users. It also relies on the assump- 
tion that while systems may differ, the main engagement 
archetypes will be fairly predictable (e.g., some users will 
be highly invested in learning every piece of content, others 
will be trying to get through as fast as possible). In this 
work, we examine the feasibility of play-testing to help clas- 
sify engagement patterns, and in particular investigate the 
following questions: 


Q1 (Distinctiveness): Are the data patterns for a set 
of play-tester archetypes distinct (different testers act 
similarly, given similar instructions)? 


e Q2 (Alignment): Will play-test archetypes align with 
unsupervised clusters producing labeled clusters simi- 
lar to how experts would label them? 


e Q3 (Semi-Supervised Comparison): Will a semi-supervised 


approach that builds a classifier from play-test and 
aligned data label individual learners more consistently 
than relying only on bottom-up clusters? 


e (4 (Basic Features): Will average response time and 
scores, in simple systems, be sufficient for reasonable 
engagement labels? 


e Q5 (Expanding Features): Will increasing the number 
of features to include task difficulty and feature inter- 
actions lead to greater consistency in fewer samples? 


These questions investigate the strengths and limitations of 
the approach. Specifically, Q1 and Q2 focus on the reliability 
of play-test labels to label unsupervised data, as compared 
to human ratings. Q3 examines if building a semi-supervised 
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Figure 1: SMART-E Analytics Pipeline Phases 


classifier is useful as opposed to simply using archetypes to 
label bottom-up clusters. Q4 and Q5 query the effectiveness 
of feature sets identified in the literature for classifying en- 
gagement, starting with a very minimal set (response time 
and scores) and then analyzing an expanded set of features 
for their impact on cold-start performance. 


4. METHODS 


To examine these questions, an analytics pipeline was devel- 
oped and then applied to a data set from a scenario-based 
intelligent tutoring system. This section will briefly describe 
the pipeline, then the learning system that produced the 
data set, and finally the techniques used to investigate each 
research question. 


4.1 Engagement Analytics Pipeline 

While this paper focuses on a specific data set, the tech- 
niques applied here are designed to be generalizable and re- 
usable as part of the SMART-E pipeline, shown in Fig. 1. 


This pipeline starts by standardizing the data available, record- 


ing data from an arbitrary learning system as learning records 
that meet the xAPI standard [1]. This “Raw xAPI” data may 
either be sent directly by the system (e.g., through an API 
for logging) or generated by running a converter on system 
logs after-the-fact. Raw xAPI data logs are then cleaned by 
a script (partially system-specific) which corrects common 
data problems, such as sessions that terminated improperly 
or missing data fields that can be inferred from other data. 
This ensures that the Canonical xAPI data store does not 
have missing data. 


All xAPI records contain metadata which allow them to be 
structured into an activity tree, representing both sequen- 
tial and parallel tasks. While the tree can be nested ar- 
bitrarily, four levels are analyzed to generate raw metrics 
tables: steps, tasks, lessons, and sessions. Raw metrics pri- 
marily record time-based information (e.g., duration of a 
task, response time for first step), score-based information 
(e.g., numerical score and/or correctness), and support used 
(e.g., hint counts, retrying a problem). 


Metrics related to task skills are not calculated, since the ma- 
jority of systems do not tag their tasks with a consistent on- 


tology of knowledge components. Intermediate metrics are 
generated using feature construction calculations based on 
raw metrics, without analyzing the xAPI logs. For this work, 
the most important intermediate metrics are averages across 
attempts (e.g., average scores, average task duration), the 
average difficulty for each task (inferred from first-attempt 
scores), z-scores for task metrics (e.g., time-on-task for the 
learner relative to other users) and a Laplace-smoothed log- 
arithm of each task duration (i.e., In(t + 1)). Additional 
metrics can be added fairly rapidly, if they rely on raw met- 
rics. 


Based on these metrics, feature vectors are generated that 
represent each learner’s performance in the system. In the 
current work, these vectors rely on all of the learner task 
data for a session, though one could generate similar fea- 
tures for specific tasks, across multiple sessions, or for re- 
cent tasks in a session (i.e., any collection of tasks). First, 
two simple features were calculated: average response time 
across tasks (Avg. RT) and average task performance (Avg. 
Score). These were considered the minimal information to 
potentially infer engagement. 


Next, a more complex feature set was developed to model 
interactions between task response time (RT), task scores, 
and task difficulty. Based on z-score cutoffs, the value of 
each variable was placed into one of three categories (low, 
medium or high) when possible, and into the most cate- 
gories available when not (i.e., only medium if all values 
equal; only low and high if only two types of values). This 
was done based on a one-dimensional Gaussian distribution, 
with cutoff values at <33% (low), 33-66% (medium), and 
>66% (high). Further we ensured that each variable had at 
least 4 corresponding data-points in order to arrive at ro- 
bust cutoffs (i.e., each unique task had been attempted by 
at least 4 different learners, to judge its difficulty, score and 
time distribution). Each scored task increments a bin asso- 
ciated with its three variables (e.g., RT'=fast, score=high, 
difficulty=high will increment exactly one out of 27 possible 
bins). This binning approach is fairly general, and can be 
inferred using only standard logging data. 


Since 27 bins will often be fairly sparse for an individual 
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learner, these were aggregated to form 7 bins which align 
to behavioral engagement patterns from the literature: Ex- 
pert, Cherry Picking, Engagement/Diligent, Self-Regulated, 
Distracted, Racing, and Careless. These bins roughly cor- 
respond to the patterns we introduced earlier except we 
omitted Nominal Engagement, roughly equivalent to av- 
erage, and we added a Careless bin focused on errors on 
easier tasks. The Expert bin was increased whenever high 
scores were obtained for difficult problems with only a nor- 
mal or low delay or for high scores on ordinary problems 
done quickly. The Cherry Picking bin incremented for high 
scores with a low delay (regardless of problem difficulty). 
Engagement /Diligent was incremented when difficult prob- 
lems were completed after a high delay. Self-Regulated was 
incremented when the amount of time spent on the problem 
was at least as high as the difficulty level, even if the score 
was not high. Distracted was triggered in the opposite case, 
where the time to respond was overly long for the difficulty 
of the problem. Racing was incremented for fast responses, 
either with low scores or with medium / high scores on eas- 
ier problems. The Careless bin included only low scores on 
easy problems or low scores on medium difficulty problems 
when completing them quickly. 


These bins were not mutually exclusive, since more than 
one behavior might explain a given interaction. Addition- 
ally, they are not validated and should be thought of as 
noisy constructs to bin low-level features, rather than neces- 
sarily predictive of their given labels. However, since these 
aggregation patterns are derived from the literature, these 
features are candidates that may be relevant across different 
systems, users, or data sets. 


4.2 User Data: ELITE Scenarios 

We use data from the system, ELITE Lite Counseling, de- 
signed for U.S. Army officers in training to learn leadership 
counseling skills, such as active listening, checking for un- 
derlying causes, and responding with a course of action [11]. 
Learners select what to say to virtual subordinates from a 
menu leading to different points in a branching graph repre- 
senting the possible conversations. The virtual subordinates 
speak using pre-recorded audio and act via 3D animations. 


Each learner choice can have both positive and negative an- 
notations. Positive annotations correspond to correctly ap- 
plying a skill such as active listening, and negative annota- 
tions correspond to omissions or misconceptions. Based on 
these annotations, a choice can be fully correct (only positive 
annotations) or two forms of incorrect: fully incorrect (only 
negative annotations) or mixed (both positive and negative 
annotations). For the pipeline, this was converted to two 
forms: a correctness category and a numerical score in which 
mixed answers were given partial credit (0.5) compared to 
correct answers (1.0) and incorrect answers (0.0). 


Each simulated conversation is also followed by an After 
Action Review (AAR) in which learners are asked multiple- 
choice questions about all of their dialogue choices that were 
mixed or incorrect. For these AAR questions, if the first 
attempt to answer was successful the learner earned a score 
of 1; otherwise, the learner earned a score of 0 but had to 
keep trying until they selected the correct response. 


The ELITE data set for this research included a corpus of 
145 subjects from experiments described here [17] which we 
consider user data. Each “user” completed three scenarios: 
Scenario 1, Scenario 1 (Repeated), and Scenario 2. Due to 
the dialog trees, users did not all see the same decision tasks 
when completing the same scenarios. However, substantial 
overlap was observed for tasks and a majority of tasks were 
attempted by a significant number of users. For the pur- 
pose of estimating task difficulty, a threshold of 5 attempts 
was used, below which the difficulty and metrics relying on 
difficulty (e.g., binning) could not be calculated. 


4.3 Play Persona Data 

Play-testers were students and employees of the lab who 
volunteered their time, and generally were not familiar with 
the scenario content. For the ELITE system, only 5 play- 
test archetypes were reasonable to classify: Expert, Diligent, 
Nominal Engagement, Racing, and Distracted. Play-testers 
followed the same protocol (i.e., scenario 1, scenario 1 re- 
peated, scenario 2) used to collect the user data. Thus, they 
had no direct control over the tasks they encountered, and 
so some patterns were unlikely to be observed (e.g., Self- 
Regulated and Cherry Picking). 


Each play-tester was able to generate data for up to three 
archetypes, by attempting them in a specific sequence. First, 
they could play as either Diligent or Distracted. These roles 
could only be played at the beginning of testing to simulate 
a novice seeing the system for the first time. Next, a Racing 
run was completed; fast response times meant testers would 
still make errors despite their previous practice. Expert runs 
were collected in two ways: either an actual expert gener- 
ated the data (2 sessions), or a tester carefully reviewed the 
correct answers (e.g., in the AAR) and/or was coached by 
an expert (13 sessions). These different methods for “ex- 
pert” data produced similar results, though actual experts 
were slightly faster. In the unlabeled data, an archetype for 
Nominal Engagement was generated by extracting five clus- 
ters and assigning Nominal Engagement to the cluster not 
aligning with the other four archetypes. 


Instructions for each play-test archetype were as follows. 
Diligent: spend as much time as you need on each choice 
to try to get the best answer, including reading carefully, 
and double-checking answers. Distracted: engage in one 
or more competing activities, including checking email and 
responding when relevant, browsing social media, engaging 
in a conversation, and eating. Racing: pretend you don’t 
care much about the content, so you are doing the bare mini- 
mum and are fine with a so-so score to get done quickly. Ex- 
pert: review content in-depth immediately ahead of time, 
and approach it with as many answers memorized or quickly- 
available as possible (e.g., in notes, from an expert) so you 
can answer well quickly. Of these, all except Distracted were 
easily understood by testers. Due to the lack of standard- 
ization for Distracted, some testers struggled to find a com- 
peting distraction task (e.g., did not use much social media, 
did not have high email volume, already ate lunch). In this 
case, a member of the research team asked the user questions 
or other requests to distract them. A total of 51 archetype 
sessions were collected, which may be more than necessary, 
since preliminary analyses found similar results with about 
25 points balanced across classes. 
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4.4 Cluster Alignment Testing 

As shown earlier in Fig. 1, both the real user data and the 
play-tester archetype data were processed by SMART-E to 
generate feature vectors that represent each individual. A 
number of techniques were then applied to generate labeled 
clusters. This cluster-labelling process allowed us to classify 
a learners’ engagement coarsely on the basis of the cluster 
that they were assigned to. 


First, user feature data was clustered bottom-up into five 
distinct clusters using k-means and Gaussian mixture mod- 
els (GMM) methodologies as implemented in the scikit-learn 
package [30]. The number of clusters was verified through an 
elbow-curve analysis of variance explained (elbow at k=5). 
Exploration of k=4 and k=6 found both to be less stable; 
cluster assignments were often very different for subsamples 
of data points, with k=4 being particularly unstable. 


For this analysis of the full sample, we associated each user 

cluster with a unique archetype (i.e., alignment of the smaller 

archetype clusters with the user clusters). The alignment 

was determined using the Hungarian Method (Kuhn-Munkres 
algorithm) ([24], which is a global, optimal-matching algo- 

rithm which minimized the sum of the Euclidean distances 

between these user cluster centroids and archetype centroids. 

As noted previously, the Nominal cluster was determined 

as the cluster remaining after all archetype groups were 

matched. As a result, each of the user clusters (and con- 

sequently the points within that cluster) had an associated 

unique archetype which additionally served as its label. When 
this cluster alignment process is used as the only technique 

to label points, it will be referred to as Clustering Alone. 


4.5 Semi-Supervised Classification 

This technique of clustering alignment was compared against 
a semi-supervised approach that built a classifier using the 
play-test and user clusters. The high level concept of this 
semi-supervised classifier is shown in Fig. 2. The first two 
steps of the semi-supervised approach are the same as Clus- 
tering Alone. This generates a pool of weakly-labeled can- 
didate labeled points. The points in this pool can be either 


taken as a full set to train a classifier model such as SVM or 
they can be sampled to incrementally train a classifier using 
active learning techniques until a stopping rule is hit (e.g., 
entropy sampling). 


To compare the classifier against cluster-level labels based on 
archetype alignment alone, we calculated two quality met- 
rics for the labels given to user sessions, which we will term 
consistency and stickiness. Consistency refers to the frac- 
tion of sessions that are labeled with the same engagement 
archetype which they would receive when the full data set 
is available. This is important because as data gets larger, 
unsupervised clusters are more likely to reflect the true dis- 
tributions. 


Stickiness refers to the likelihood that a user session retains 
the same engagement label after a batch of new data is added 
(similar to intra-rater reliability). This is important for ac- 
tionable engagement metrics: if Student A is classified as 
Diligent, it will be confusing if Student B who completes an 
identical run is classified differently due to data that arrived 
in between. While this cannot be fully avoided, approaches 
that tend to keep the same label for an identical session will 
appear more fair and reliable, so that an instructor could be 
more confident in using the classifications. 


That said, neither consistency or stickiness alone are suffi- 
cient for useful classification. For example, always assign- 
ing all users to the same category maximizes both metrics. 
However, assuming clusters for the full data set are reliable, 
then these measures help to identify how quickly and reli- 
ably labels approximate the final labels. This is important 
for addressing the cold start problem, so that engagement 
patterns can be quickly identified in a new system. 


To calculate the number of samples to reach a given level of 
cold start performance, random splits were made of the user 
data set into train-test subsets (115 train, 30 test). For each 
random split, the classifier was trained using the archetype 
data set (51 samples) and increasingly larger subsets of the 
user training data in increments of 5. When evaluating cold- 
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start performance, a consistency of 85% was considered a 
reasonable target threshold for reliability against the full 
sample. While the actual consistency required will depend 
on the specific application, this cutoff should give some in- 
sight into how quickly different approaches converge toward 
their larger-sample performance. 


Since the pipeline parameterizes the specific algorithms, follow- 


up exploratory analyses were conducted with different types 
of clustering algorithms (e.g., k-means, GMM), classifica- 
tion algorithms (e.g., logistic regression, support vector ma- 
chines), and semi-supervised sampling algorithms (e.g., full 
sampling, margin sampling with stopping rules to exclude 
certain unsupervised samples). Different combinations of 
these algorithms did not show qualitatively different end- 
results on these metrics, and any differences were not conclu- 
sive (e.g., GMM clusters appeared slightly more stable than 
k-means as data was added, but within random variation). 
As a result, this paper presents results for the GMM clus- 
tering with a Support Vector Classifier, where these results 
are representative for the different approaches explored. 


5. RESULTS 


Focusing on GMM clustering, we revisit the alignment of the 
five user clusters with the play-tester archetype groups. The 
clusters were generated using the average of the logarithms 
for task response time (Log-RT) and average of task scores 
(Scores). Fig. 3 plots the real user data with unsupervised 
clusters. Table 1 shows feature means and standard de- 
viations for each archetype, above its most closely-aligned 
bottom-up cluster. Note that while Log-RT was used for 
clustering, the actual time in seconds is given in the table 
and figure for easier interpretation. 


Despite being generated independently, the play-test data 
closely resembles the real bottom-up clusters. As a trend, 
the play-tester archetypes tend to be more extreme (i.e., far- 
ther from the average user) than the clusters they align to. 
This is likely due to play-testers acting out more exagger- 
ated or consistent patterns than real users. However, this 
may actually be an advantage, since play-test archetype data 
points may be more likely to be outliers in the vector space 
and good anchors for distinct clusters. The results from Ta- 
ble 1 support research question Q1, in that play-testers were 
able to act out similar patterns as real users and that the 
play-tester data showed fairly distinct groupings (as evident 
in the standard deviation values). One exception was the 
Distracted archetype, which had a very high variance for 
time compared to real users in the corresponding cluster. 
However, despite the high variance, the Distracted archetype 
data remained distinct from other archetypes’ data. 


5.1 Reliability vs. Expert Labels 

The validity of this alignment on the full data set was eval- 
uated by surveying a set of external engagement experts 
(N=5) to label the same bottom-up clusters obtained from 
the user feature data, based on the descriptions of the en- 
gagement archetypes. Selection criteria for experts required 
a Ph.D. in a relevant area, publishing at least one substan- 
tial paper researching learner engagement, and having no 
prior experience with the data set. 


Experts labeled cluster graphs (e.g., Fig. 3) generated by 


Group N | Avg. RT (s) | Avg. Score 
Expert (Arch) 15 8.53 + 2.43 0.95 + 0.04 
Cluster 1 25 8.10 + 1.00 0.93 + 0.03 
Diligent (Arch) 14] 13.154 3.83 0.89 + 0.07 
Cluster 2 75 11.06 + 1.61 0.90 + 0.03 
Nominal (Arch) - - - 
Cluster 3 13 8.63 + 1.11 0.82 + 0.02 
Distracted (Arch) | 12 | 22.274 13.80 | 0.77+0.17 
Cluster 4 28 15.81 + 3.43 0.83 + 0.07 
Racing (Arch) 10 7.18 + 2.47 0.56 + 0.17 
Cluster 5 4 7.98 + 1.08 0.55 + 0.09 


Table 1: Cluster vs. Archetype Centers (uw +c) 
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Figure 3: GMM User Clusters for Response Time and Score 
Features 


both k-means and GMM, and maintained quite similar la- 
bels across each (76% agreement). Since the clusters and la- 
bels for both GMM and k-means were very similar, all labels 
were treated as examples from the same task. Inter-rater re- 
liability metrics were moderate between experts: 55% Agree- 
ment; Fleiss’ kappa = 0.44; Krippendorff’s alpha = 0.45. 
Expert raters had very high reliability for Expert and Rac- 
ing labels, but approximately half of experts demonstrated 
a consistently different interpretation for Diligent, Nominal 
(phrased as “Average” in the survey), and Distracted. Based 
on open response comments, this may have been the result of 
interpreting minor wording differences in the prompts (e.g., 
“novice learners” for Diligent vs. “learners” in Distracted). 


The human labels for clusters were then compared pair- 
wise against the automated alignments, resulting in Agree- 
ment, Fleiss’ Kappa and Krippendorff’s alpha metrics which 
were higher than within-experts though still in the moderate 
range: 66% Agreement; Fleiss’ kappa = 0.57; Krippendorff’s 
alpha = 0.58. Given these results and expert sensitivity to 
the wording of archetype descriptions, we conclude that the 
automatic alignment appears to be at least as useful as ex- 
pert consensus ratings for labeling engagement clusters. We 
anticipate automatic alignment to be even more advanta- 
geous when the feature space expands beyond 3 dimensions, 
making it difficult for human experts to visualize or evaluate. 
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(c) Consistency: Clustering Alone 
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(d) Label Stickiness: Clustering Alone 


Figure 4: Consistency and Stickiness for Semi-Supervised vs. Clustering Alone 


5.2 Consistency of Semi-Supervised vs. 


Clustering Alone 

To evaluate how play-test data can be used to classify new 
user sessions, a semi-supervised approach was explored which 
trained a Support Vector Machine (SVM) classifier using 
both the play-test archetype data and the data from the 
bottom-up cluster that best aligned to each archetype, with 
test-set labels determined by the classifier. For the cluster- 
ing alone comparison case, bottom-up clusters were directly 
aligned against archetype data to determine their labels and 
test-set labels were determined based on their closest clus- 
ter. 20 random splits were made of the user data set into 
train-test subsets (115 train, 30 test). For each random split, 
the classifier was trained using the archetype data set (51 
samples) and increasingly larger subsets of the training data 
set in increments of 5, and then evaluated. 


Consistency was calculated against the test set of 30 sam- 
ples. Stickiness of labels was calculated for each set of la- 
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bels against the prior set (e.g., model trained on N sam- 
ples vs. N-5 samples). Due to the higher level of noise 
for clustering alignment alone, 100 runs were conducted in- 
stead of 20 for a smoother average. These results indicated 
that training a classifier which combined both types of data 
produced higher consistency and less variation. Specifically, 
on the basic features (avg. RT and performance only), the 
semi-supervised SVM reached 85% average consistency at 
52 samples (Fig. 4a), while aligned clusters alone required 
95 samples to reach this level (Fig. 4c). Clustering alone 
was more consistent with the full-data cluster labels until 
approximately 25 samples (i.e., when the user data reached 
approximately half of the archetype data). 


Likewise, the stickiness of labels as data increased reached 
an average of 85% by 45 samples for the semi-supervised 
classifier (Fig. 4b). Clustering alone never reached 85% and 
remained less than 70% on average (Fig. 4d). For both 
metrics, the variance (blue bars) were larger for clustering 
alone. One reason for greater variability for clustering alone 
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is that sparse data for certain cluster regions (e.g., Rac- 
ing, with only 4 real users), so alignment alone may try to 
align a non-existent cluster given limited data. However, 
the semi-supervised classifier appears to mitigate this issue 
since training is anchored by play-test data points. 


These analyses were performed using both the basic features 
and the expanded feature set (e.g., bins that count instances 
of engagement behavior patterns based on response time, 
score, and difficulty categories). Both feature sets required 
a similar number of samples to reach the same level of consis- 
tency (e.g., about 85% consistency after 50 samples). While 
it is possible that the expanded feature set might produce 
more valid labels for an instructor (e.g., better reflecting the 
categories of users who an instructor might follow-up with), 
this will not be due to improved cold-start performance. 


5.3. Semi-Supervised Class-wise Consistency 

An analysis of the consistency for labels in individual clus- 
ters (Fig. 5) shows similar insights to the overall clustering 
label consistency. Points in larger clusters (e.g., Diligent, 
Expert) are consistent fairly quickly. However, small clusters 
(Racing) may have few/zero examples even when consider- 
ing as many as 40 data points, and even with 100 data points 
have poor consistency. As such, classes with few examples 
might only be useful for a smaller set of use-cases (e.g., suf- 
ficient to share with an instructor, but possibly not reli- 
able enough to take an automated action confidently). We 
also note that based upon the stickiness analysis (Fig. 4d), 
performance may be limited by the instability of clustering 
(points moving between clusters even with nearly full data). 
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Figure 5: Class-wise Consistency for Semi-Supervised SVM 


5.4 Semi-Supervised vs. Final Clusters 

The semi-supervised results were compared for their agree- 
ment with the labels obtained via alignment with the final 
clusters generated using the full data set. This final-clusters 
reference point (see Fig. 3) was used to calculate average ac- 
curacy, precision, recall and F-scores (Fig. 6), as a function 
of the increasing dataset size. While final clusters are not 
a perfect reference, it shows that accuracy versus final clus- 
ters increases fairly rapidly, but that precision, recall and 
F-scores are consistently lower. 
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Figure 6: Agreement of Semi-Supervised SVM vs. Final 
Cluster Labels 


6. DISCUSSION 


Based on the results presented, this work demonstrates the 
feasibility of using a play-testing methodology for detect- 
ing behavioral patterns of engagement. Moreover, this work 
also found that a classifier could be developed using this 
approach without engineering application-specific features. 
The classifier also offered reasonable cold-start performance 
and labeled engagement data fairly consistently for 5 cate- 
gories after 52 unlabeled samples and 51 archetype samples. 


Of the five research questions investigated, there was posi- 
tive support for four answers, with one left indeterminate. 
For Q1, play-tester data was distinctive and archetype data 
followed coherent patterns on features (e.g., response time, 
correctness). Archetype data did not show substantial over- 
lap between archetypes, even though play-testers received 
only limited instructions. This may be due to the limited 
degrees of freedom for the task. In a more complex or open- 
ended system, increased variation might lead to less coherent 
archetype data. With that said, many systems have simi- 
lar characteristics to the ELITE scenarios studied here (i.e., 
sequential linear or branching choice tasks, mixed with pas- 
sive content such as videos or animations). Moreover, these 
kinds of systems are often problematic for engagement, such 
as mandatory corporate training modules. 


For Q2, it was demonstrated that automated matching of 
play-test archetypes against pre-defined clusters performed 
comparably to expert labels for the same clusters. While 
refining instructions might improve inter-rater reliability on 
this specific task, the features presented to experts were al- 
ready chosen to be simple and visualizable so this repre- 
sented an optimistic scenario for expert cluster labeling. On 
more complex feature sets or systems, expert analysis might 
not even be possible. The broader question not explored 
in this work is the machine vs. human play-test agreement 
if they were not given pre-defined clusters (i.e., a data ex- 
ploration task). However, this would be challenging to con- 
duct: it requires a deep analysis by each expert researcher 
and the types of engagement categories might be highly un- 
even. Alternatively, archetypes might be determined from 
already-analyzed data sets (e.g., such as for hint-abuse), to 
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see how effectively traces of play-test disengagement might 
match authentic disengagement patterns. 


For Q3, it was established that training a classifier with 
both play-test data and unsupervised cluster data showed 
advantages over simply re-clustering with new unsupervised 
data and then aligning clusters to archetype data. In some 
respects, this is not surprising: while the consistency met- 
ric used for evaluation is based on the unsupervised results 
from the full data set, the classifier is able to train with 
more data up-front (as much as double initially). More im- 
portantly, since all key archetypes are present in the play- 
test data, no category will start unrepresented. This par- 
ticularly helps for classifying points from relatively rare but 
distinctive categories (e.g., Racing). However, despite this 
advantage, points in small classes remained substantially less 
consistent than those in larger classes. 


As a long term issue, it is an open question about the best 
way to mix this data. Neither data source represents ground 
truth. The archetype data demonstrates coherent engage- 
ment patterns, but these patterns might not reflect the ways 
real users experience the system (e.g., in the current re- 
search, they were exaggerated/overly extreme). The real 
user data is authentic, but may slowly wash out the classi- 
fier with unremarkable samples (e.g., overly ordinary). Ex- 
ploratory work was conducted where stopping rules were 
applied to balance the number of archetype vs. authentic 
samples (derived from active learning techniques, such as 
margin sampling and entropy sampling), but this has not 
yet produced obvious improvements. Similarly, techniques 
for weighting samples might be applied. However, the ideal 
balance between these data sources probably depends on 
the target use-case for the classifier. A recommender sys- 
tem may want a classifier that acts on labels regardless of 
their confidence scores. By comparison, a human instructor 
might prefer a narrowly-scoped but highly-actionable classi- 
fier, which might detect clear outliers but allow the majority 
of user sessions to be in a non-descript “Nominal” category 
or not confidently classified. 


On questions about the features required to classify en- 
gagement, we found that basic features for the log of re- 
sponse times and scores were sufficient in this case (Q4) but 
did not show improvement with the expanded feature set 
including task difficulty and feature interactions improved 
classification (Q5). These features helped to detect engage- 
ment behavior that matched patterns observed from play- 
testing: Expert/Recall, Diligent, Racing, and Distracted 
as well as Nominal (i.e., matched by exclusion). However, 
both k-means and GMM tended to split up the mass of 
points in the region of Expert, Diligent, and Nominal despite 
these clusters being adjacent to each other. The cluster- 
alignment approach used in this work was selected primar- 
ily for the ability to interpret cold-start trends, while more 
advanced methods should further improve performance. It 
might be preferred to investigate techniques such as anomaly 
detection, which would favor a larger central cluster and 
smaller outliers which could correspond to atypical behavior 
which is actionable. Alternatively, alternate semi-supervised 
techniques are available, such as applying specialized semi- 
supervised support vector machines (which optimize mar- 
gins for both labeled and unlabeled data) [8, 31] or more 


advanced techniques for integrating cluster data [16]. While 
expanded features did not improve consistency or stickiness 
metrics (Q5), other systems may still benefit from expanded 
features. However, additional features also increase the re- 
quired data and may result in overfitting, need attenuat- 
ing/filtering features during clustering, or other trade-offs. 
As such, further research is needed on this problem. 


7. CONCLUSIONS AND FUTURE WORK 


Based on these findings, this work contributes a number of 
novel approaches to analyzing engagement. First, this re- 
search demonstrates the utility of play persona data gath- 
ered during professional or quality assurance testing for train- 
ing useful data mining algorithms. Since there is no defini- 
tive metric for engagement, play-test data offers an addi- 
tional distinct data source to help recognize engagement and 
disengagement. To our knowledge, this approach has not 
been applied to analyzing engagement in learning. 


Second, this approach offers advantages over current ap- 
proaches for cold-start labels. Since the behavioral inten- 
tions of the play-test users is known with confidence, these 
labels offer a good data set to help overcome cold start prob- 
lems. As compared to traditional approaches such as train- 
ing observers or collecting in-the-moment self-reported en- 
gagement [13, 29], play persona data can be collected prior 
to real system users. This approach also allows balanced 
sampling for important but lower-frequency engagement be- 
haviors (such as racing, in this analysis). 


Third, we have demonstrated that semi-supervised classi- 
fiers trained based on a combination of play-test labels and 
unlabeled data offer more consistent labels than relying on 
clustering alone, which has been used to analyze engagement 
behaviors [23]. Moreover, as shown by agreement with ex- 
pert labels at the cluster level, the alignment approach can 
provide similar insights without manually interpreting clus- 
ters. While expert interpretation is still ideal, this allows 
immediate insights without waiting for an expert analysis. 


This approach is also pragmatic: System developers should 
already test and perform quality assurance on their soft- 
ware and content [35]. Behavioral archetype data can be 
collected during this process, by having testers play out en- 
gagement styles in a prescribed order based on their ex- 
pected learning. Moreover, this work is not unique to spe- 
cific archetypes: if learners are expected to engage in dif- 
ferent patterns, play-testers may be able to produce those 
patterns instead. However, not all archetypes may be re- 
alistically playable by testers. For example, experts cannot 
typically generate novice answers. As such, this approach 
may be most effective when testers are similar to authentic 
users. As such, future work will explore how expert observer 
labels and self-report data might complement this play per- 
sona data. 
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ABSTRACT 


Improving the pedagogical effectiveness of programming trai- 
ning platforms is a hot topic that requires the construction 
of fine and exploitable representations of learners’ programs. 
This article presents a new approach for learning program 
embeddings. Starting from the hypothesis that the function 
of a program, but also its ’style”, can be captured by analyz- 
ing its execution traces, the code2aes2vec method proceeds 
in two steps. A first step generates abstract execution se- 
quences (AES) from both predefined test cases and abstract 
syntax trees (AST) of the submitted programs. The doc2vec 
method is then used to learn condensed vector representa- 
tions (embeddings) of the programs from these AESs. Ex- 
periments performed on real data sets shows that the embed- 
dings generated by code2aes2vec efficiently capture both the 
semantics and the style of the programs. Finally, we show 
the relevance of the program embeddings thus generated on 
the task of automatic feedback propagation as a proof of 
concept. 


Keywords 

Representation Learning, Program Embeddings, Neural Net- 
works, Educational Data Mining, Computer Science Educa- 
tion, doc2vec. 


1. INTRODUCTION 


Increasingly, programming is being learned through the use 
of online training platforms. Typically, learners submit their 
code(s) and the platform returns any syntax errors or func- 
tional errors (typically based on test cases defined by the 
teacher). The exploitation of these data opens up new per- 
spectives to monitor and help beginners in learning program- 
ming. They can be used, for example, to identify students 
who are dropping out, to target bad practices or to prop- 
agate teacher feedbacks. These functionalities would allow 
the learner to be more autonomous during his learning, and 
the teacher to be more reactive and efficient in his interven- 
tions. However, this exploitation requires a detailed analysis 
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of the submitted programs. These training platforms must 
go beyond a simple syntactic analysis of the script, and al- 
low the associated semantics to be considered. For this pur- 
pose, learning program embedding has recently emerged as 
a promising area of research [15, 13, 17, 7, 2, 3]. A natural 
way to generate such vectorial and condensed representa- 
tions is to consider a computer program as a text and to 
exploit methodologies inspired by Text Mining. 


Text mining has attracted a lot of interest in recent years. 
The representation of texts as vectors of real numbers, also 
called ”embedding”, has been at the heart of many recent 
works. These representations make it possible to project (or 
?embed’) a whole vocabulary into a low-dimensional space. 
Moreover, such a representation of words allows to exploit 
a wide variety of numerical processing methods (neural net- 
works, SVM, clustering, etc.). At that stage, one of the 
challenges is to capture in these representations the underly- 
ing semantic relationships (e.g. similarities, analogies). The 
work of [11] based on the use of neural networks has been 
a precursor in this area. Their word2vec method is one of 
the most referenced in the field. Its principle is based on 
the relation between a word and its context (words appear- 
ing before and after). To do this, they propose some simple 
and efficient architectures to learn word embeddings from a 
corpus of texts. For example, the CBOW (Continuous Bag- 
Of-Words) architecture trains a neural network to predict 
each word in a text given its context. Their results show the 
ability of this approach to extract complex semantic rela- 
tions (analogies) from simple operations on v() projections, 
s.t. u(’ king”) — v(’ man”) + v(” woman”) = v(” queen”) or 
v(? Paris”) — u(? France”) + v(” Italy”) © v(” Rome”). 


The transposition of these approaches to computer programs 
is not straightforward. The code has certain specificities that 
need to be integrated to have such rich representations [1]. 
Unlike texts, codes are runnable, and small modifications 
can have significant impacts on their executions. A pro- 
gram can also call other programs that can themselves call 
other programs. The context in which an instruction is used 
is also particularly important in deducing its role. Finally, 
unlike texts, program syntax trees are usually deeper and 
composed of repeating substructures (loops). Existing ap- 
proaches for building program embeddings only partially in- 
tegrate these specificities. They independently exploit the 
instructions [3], the inputs/outputs [15], part of the execu- 
tion traces [17] or the abstract syntax tree (AST) [2]. They 
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Figure 1: General scheme of the code2aes2vec method for learning program embeddings. 


focus more on the function of the program (what it does) 
than on its style (how it does). Moreover, most of these ap- 
proaches are supervised and build embeddings for a specific 
task (e.g. predict errors, predict functionality, etc.). 


In view of these limitations, we propose the code2aes2vec 
method, exploiting instructions, code structure and execu- 
tion traces of programs, in order to build finer program 
embeddings. Figure 1 schematizes the overall approach 
code2aes2vec we propose. The first step of this method 
consists in generating Abstract Execution Sequences (AES) 
from traces obtained on test cases executions and program 
ASTs. The second step uses the doc2vec' neural network [8] 
to learn program embeddings from AESs. Contrary to ex- 
isting approaches, we therefore propose a generic and unsu- 
pervised method that learns program embeddings by using 
functional, stylistic and execution elements. This aspect is 
crucial in our application to be able to differentiate programs 
answering to the same exercise (i.e. implementing the same 
functionality) but in different ways (in terms of strategy or 
efficiency). Our approach is validated on two real data sets, 
composed of several thousands of Python programs from 
educational platforms. On these datasets, we show that em- 
beddings generated with code2aes2vec allow to efficiently 
detect the function and the style of a program. In addition, 
we present a proof of concept of the use of such embeddings 
to propagate teacher feedbacks. 


To summarize, the main contributions of this paper are : 


1. the definition of a new (intermediate) program repre- 
sentation, called Abstract Execution Sequences (AES), 
allowing to capture more semantics, 


2. exploiting these (intermediate) representations with 
doc2vec to build program embeddings in an unsuper- 
vised way, 


3. the diffusion to the community of two enhanced 
datasets from educational platforms in computer sci- 
ence, 


Ta method derived from word2vec that allows to learn a 


document embedding from its words. 


4. a proof of concept on the use of such program em- 
beddings for feedback propagation for educational pur- 
poses. 


The next section details existing works in the field and 
the originality of our approach in relation to it. Section 3 
presents our two-steps method: the construction of abstract 
execution sequences (AES) and the learning of embeddings 
from them. Section 4 is devoted to the qualitative and 
quantitative evaluations of the learned representations be- 
fore drawing up the many perspectives of this work (sec- 
tion 5). 


2. RELATED WORKS 


Learning representations from programs is at the heart of 
many recent works. They aim to embed this data in a se- 
mantic space from which further analysis can be conducted, 
because the generated representations (vectors of real val- 
ues) are directly exploitable by a large part of the learning 
algorithms. To do this, recent work relies heavily on meth- 
ods developed to build word embeddings in texts, while try- 
ing to integrate the specificities of the code. 


These program embeddings are then used for prediction 
or analysis tasks related to the software development (de- 
bugging, API discovery, etc.) and teaching (learning pro- 
gramming). Two types of embeddings are more particularly 
studied: embeddings of the elements composing a program 
(words, tokens, instructions, or function calls) [12, 6, 7] and 
embeddings of the programs themselves [15, 17, 3, 4]. 


Nguyen et al. [12] study API call sequences to derive an 
API embedding that is independent of the programming 
language (thus allowing translation from one language to 
another). This embedding is learned by word2vec [11] from 
call sequences from several million methods. This method 
allows to build a word embedding from words appearing 
nearby in each text (i.e. its context). Two datasets derived 
from the Java JDK and from more than 7000 recognized 
C# projects from GitHub (more than 10 stars) are used for 
learning. 


De Freez et al. [6] have a close objective, namely to build 
an embedding of functions used in a code in order to find 
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synonymous functions. However, the function calls made in 
each program are extracted in the form of a graph depending 
on the control structures. Random walks are then performed 
in this graph for each function, and the extracted paths are 
used as input to word2vec [11] to derive an embedding of 
functions. An extract of two million lines of the Linux kernel 
is used to learn the embedding. 


In [7], the authors propose a relatively similar approach but 
replace function calls with abstract instructions. Each in- 
struction sequence represents a path in the code depending 
on conditional instructions. In this way, it is similar to a 
trace even if the code is never executed. All possible paths 
are extracted but loop repetitions are ignored and only the 
most frequent instructions are considered (a threshold of 
1000 occurrences is used in the experiments). Moreover, 
only certain constant values are considered to limit the size 
of the vocabulary. An embedding of these program instruc- 
tions is then learned by the GloVe method [14] based on the 
co-occurrence matrix of the words. The authors also use a 
corpus of 311,670 procedures (in C language) from the Linux 
kernel, and evaluate it on a data set of 19,000 pre-identified 
analogies. 


In many applications, it is not just a matter of consider- 
ing the program’s components but the program as a whole. 
Recent work has therefore focused on the construction of 
program embeddings. 


Following a teaching’ motivation, Piech et al. [15] construct 
student program embeddings and use them to automatically 
propagate teacher feedbacks (using the k-means algorithm). 
The embedding space is built from a neural network trained 
to predict the output of a program using its input. It thus 
captures the functional aspect of the code. The authors also 
try to capture the style of programs, using a recursive neu- 
ral network based on the program’s AST. Contrary to other 
approaches, the generated embeddings are matrices, not vec- 
tors, thus limiting their exploitation as inputs for next data 
analyses. They also don’t consider learner-defined variables. 
The obtained representations actually capture quite well the 
code function, but fail in capturing the style of codes. The 
analyzed programs (from the Hour of Code site and a course 
at Stanford) are written in a language similar to Scratch and 
allow operations in a labyrinthine world. 


In [17], the authors highlight the limitations of syntax-based 
approaches to capture the semantics of a program. Instead, 
they propose to consider the trace resulting from the code 
execution, and more precisely the values of the most fre- 
quent variables. Different representations are proposed and 
used to train a recurrent neural network whose objective is 
to predict errors made by students in a programming course. 
The embeddings of the programs corresponds to one of the 
neural network’s layers. The authors put forward one rep- 
resentation more particularly, considering the trace of each 
variable independently and integrating the dependencies be- 
tween variables in the structure of the neural network. How- 
ever, the obtained embeddings are specific to one task, and 
this method requires to redefine the neural network archi- 
tecture, with re-training, for each exercise. 


Finally, Alon et al. [2] propose a neural network to pre- 


dict the name of a method (i.e. the functionality) from its 
code. To do this, the program is first decomposed into a 
collection of paths (from one leaf to another) in the AST. 
Only the most frequent paths in the dataset are used as 
features (size constraints are also integrated). Then, the 
network learns which one is important for predicting the 
method name using the attention principle. The parameters 
of the trained neural network correspond partly to the final 
embeddings and partly to the weights supposed to quantify 
the importance of each (feature) path for the prediction task 
(attention principle). Training is performed on a corpus of 
more than 13 million Java programs from GitHub’s 10,072 
most popular projects. As mentioned by the authors, this 
approach requires a large number of input programs. Fur- 
thermore, it is not possible to predict the function (and em- 
bedding) of a program whose paths do not appear in the 
training set. The embeddings produced capture informa- 
tion and semantic relationships about the function of the 
code, but ignore style variants. Thus, two programs with 
the same function will be similar, regardless of how they 
have been coded. The quality of the analyzed code also has 
an impact on learning. The names given to the variables are 
particularly important for prediction. 


3. THE code2aes2vec METHOD 


Two main strategies emerge for learning program embed- 
dings : by observing the results of program execution [15, 
17] or by analyzing the script [3] and/or its AST [2]. Our 
approach is at the intersection of these two strategies and 
thus aims to take advantage of the functional and syntactic 
descriptions of the programs to induce relevant embeddings. 
We thus propose the code2aes2vec method which proceeds 
in two steps: 


1. the code2aes step represents a program as an Abstract 
Execution Sequence (AES), corresponding to the AST 
paths used by the program during its execution on 
predefined test cases; 


2. the aes2vec step uses a neural network to construct 
the embedding of the programs based on their AES 
(using the doc2vec approach [8]). 


3.1 code2aes: construction of Abstract Execu- 
tion Sequences (AES) 


Translating a program into an AES requires providing, in 
addition to the program itself, a collection of test cases on 
which the program will be run in order to exploit its traces. 
In our educational context, the preparation of such a collec- 
tion of test cases is not an additional effort since test cases 
are generally integrated into training platforms to evalu- 
ate submitted contributions. Moreover, this approach offers 
teachers the possibility of introducing verification choices 
and thus to drive the interpretation of his/her learners’ pro- 
grams according to his/her own pedagogical choices. For 
example, let’s consider an exercise whose objective is to find 
a value in a table/list. A teacher wishing to emphasize al- 
gorithmic efficiency may choose to integrate a few unit tests 
for which the desired value appears early in the table. In 
such case, an efficient program stops the loop as soon as the 
desired value appears. These test cases will thus make it 
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possible to distinguish two (valid) programs based on their 
execution trace. 


In practice the number of test cases provided by the teacher 
is quite small. We generally observe that less than ten test 
cases are enough to evaluate whether a program is correct 
or not. 


Figure 2 illustrates the process of translating a program into 
an AES. This example considers as input the code submitted 
by a learner in response to the exercise ’write a Python func- 
tion that returns the minimum value in an input list’. First, 
the AST is constructed. It describes the syntactic struc- 
ture of the program in terms of control structures (if-else, 
for, while), function calls (call), assignments (assign), 
etc. Second, the code is executed on an example (here the 
input [12, 1, 25]) and its execution trace is kept, indi- 
cating the program lines successively executed. Finally, the 
AES is constructed by mapping these two levels of informa- 
tion: syntactic and functional. The sequence resulting from 
the trace is translated into a sequence of ”words” extracted 
from the nodes of the AST. 


Three levels of translation (or abstraction) are proposed ac- 
cording to the depth considered in the AST: 


e AES level 0: each program line is represented by a 
single word corresponding to the head symbol of the 
associated sub-tree in the AST (in red in Figure 2), 


e AES level 1: each program line is represented by one 
or more words corresponding to the head symbols of 
the associated sub-tree and its main sub-trees (in red 
and blue in Figure 2), 


e AES level 2: each program line is translated in a se- 
quence of words corresponding to all the nodes appear- 
ing in the associated sub-tree (in red, blue and black 
in Figure 2). 


For the last two levels, the names of the variables and pa- 
rameters, as well as the values of the constants, have been 
normalized so as not to artificially extend the considered 
*vocabulary”. Thus, the variable res is renamed vari and 
the variable i is renamed var2. 


Note that each execution of a program on one test case gen- 
erates a partial AES. A program will finally be represented 
by the concatenation of the partial AES obtained on each 
test case. An AES can thus be considered as a representa- 
tive text of the program. Each partial AES corresponds to 
a sentence of this text. 


3.2 aes2vec : learning program embeddings 


from AES 
The word2vec method [11] is based on the distributional hy- 
pothesis of words in natural language [16] : a word can be 
inferred from its context. For example, the CBOW (Con- 
tinuous Bag Of Words) version of word2vec allows to train 
an feed-forward Neural Network to predict a (central) word 
from its context. In word2vec a context is defined by the 
preceding and following words in the text. The structure of 


the neural network is reduced to a single hidden layer (en- 
coding) ; the matrix W of the weights connecting the input 
layer to the hidden layer contains the word embeddings. 


word2vec has already been used to learn token embeddings 
from a computer program [6, 3]. However the distributional 
hypothesis seems less satisfied on the tokens from a program 
than on the natural language, in particular because of the 
very limited size of the vocabulary and especially a little 
constrained compositionality (almost all combinations are 
observed). 


[8] have proposed a variant of word2vec, aiming to learn 
simultaneously the embeddings of the words and the docu- 
ments from which they are extracted. The doc2vec method 
is still based on the distributional hypothesis allowing to 
predict a word knowing its context, but this time the con- 
text integrates (in addition to the preceding and following 
words) the identifier of the document from which the word 
sequence comes from. In doing so, the authors introduce 
the idea that there are document specific variations in the 
natural/universal distribution of words in the language. 


We exploit precisely this hypothesis of document-based dis- 
tributional variations for the processing of AES built from 
the programs. We consider that each program, during its 
execution, generates different sequences of tokens (AES). 
We then use the DM (Distibutive Memory) version of the 
doc2vec algorithm to train a feed-forward Neural Network 
(with one hidden layer) to maximize the following log prob- 
ability : 


Ts—k 


L=>_ YS logp(wi|wi_x,..- 


s=1 i 


Wisk, ds) (1) 


with S the total number of documents (or AESs), T; the 
total number of words (or tokens) in document s, d, the s‘” 
document, w? the i'” word in document ds and k the size of 
the context on either side of the target word. 


Figure 3 presents the architecture of the neural network 
used for doc2vec as we use it for learning program em- 
beddings via their AES. The forward pass consists in first 
calculating the values of the hidden layer by aggregating 
the encodings of each word of the context and of the docu- 
ment : h(w7_,,---,Wii~;W, D) where h() denotes an ag- 
gregation function to be defined (typically a sum, aver- 
age or concatenation), W and D denoting the word em- 
bedding matrix (weight matrix between word inputs and 
hidden layer) and the document embedding matrix (weight 
matrix between document input and hidden layer) respec- 
tively. The output of the neural network can be interpreted 
as a probability distribution on the words of the vocabulary 
by applying an activation function (softmaz) on the output 
Yws = b+ Uh(wj_x,---;Witn; W, D) where 6 is a bias term 
and U the weight matrix between hidden and output layer: 


p(w; |wi_r, see ,Witk, ds) = 
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Figure 2: Construction of an AES from a Python program and a given test case. 


Finally, the network weights are updated by stochastic gra- 
dient descent on the error, defined by the difference between 
the obtained output and the one-hot vector encoding the 
target word. 


We consider programs as textual documents whose word se- 
quence is given by an AES obtained by the previous step 
(code2aes). Indeed, as for text, the choice and the order 
of the ’words” in an AES capture the semantic of the pro- 
gram, i.e. what the program does (its function) and how it 
operates (its style). 


In this learning model, each AES vector (in D) is used only 
for predictions of the tokens from this AES, while token vec- 
tors (in W) are common to all AESs. The size of the vectors 
(for AES and tokens) is fixed and actually corresponds to 
the size of the desired representation space (embeddings). 


Once the model is trained, the D matrix contains the em- 
beddings of the programs. The positioning of a new program 
in this embedding space consists in inferring? a new column 
vector in D using the tokens from the new AES. The other 
parameters of the model remaining fixed (W as well as the 
softmaz parameters). 


Finally, let us mention that the choice of the aggregation 
strategy used in the hidden layer can be decisive. Indeed, 
a sum or average will consider each context as a bag-of- 
words (without taking into account the order), whereas a 


Inference is made by purshasing the learning on the Neural 
Network. 
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concatenation strategy offers the opportunity to exploit the 
order of words within the context. If the sequentiality (inside 
the context) is not a determining factor in the construction 
of embeddings for natural language, we will confirm in future 
experiments that the order of tokens is of high importance 
for learning program embeddings using AESs. 


4. EVALUATION OF THE APPROACH 


4.1 Dataset presentation 

Educational data are complex since programs may contain 
errors, be small in size, may not fully meet the intended 
functions and may be relatively redundant. These data have 
very different characteristics from the datasets used in soft- 
ware development. For our experiments, we thus built and 
use several real educational datasets (see Table 1). They 
consist of Python programs submitted by students on two 
training platforms in introductory programming courses. In 
addition to our (documented) code2aes2vec code for learn- 
ing program embeddings, we also make available? these three 
*corpora” of Python programs, the associated test cases, and 
the AESs built on each program. All of the results presented 
in the rest of this section can thus be fully and easily repro- 
duced. 


The NewCaledonia-5690 dataset (or NC-5690) includes the 
programs created in 2020 by a group of 60 students from the 
University of New-Caledonia, on a programming training 


platform’. The NewCaledonia-1014 dataset (or NC-1014) 


3 https://github. com/GCleuziou/code2aes2vec. git 
“Platform developed and made available by the CS depart- 
ment of the Orléans University Institute of Technology. 
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Figure 3: The neural network aes2vec used to predict a word w; from its original AES identifier and its context (here, two 


previous and two following words). 


Table 1: Characteristics of the three Python datasets col- 
lected on programming training platforms. The ’Nb. of 
words’ reported is the total size of the AES corpus; the num- 
ber in parentheses indicates the size of the ’vocabulary’. 


Datasets NC-1014. NC-5690 Dublin-42487 
Nb. programs 1,014 5,690 42,487 
Avg nb. test cases 13.1 10.4 3.7 
per program 

Nb. correct 189 1,304 19,961 
programs 

Nb. exercises 8 66 65 
Nb. ’words’ 113,223 761,726 7,4 M 
AES-0 (20) (44) (38) 
Nb. ’words’ 226,682 17M 15,2 M 
AES-1 (42) (57) (83) 
Nb. ’words’ 690,019 3,9 M 40,4 M 
AES-2 (71) (113) (209) 


includes a sub-part of NewCaledonia-5690 composed of con- 
tributions associated with 8 exercises selected for their al- 
gorithmic diversity (see Table 2) and their balanced vol- 
umetry (100 to 150 programs per exercise). We will use 
it as a ’toy’ dataset facilitating qualitative analyzes. The 
Dublin-42487 dataset includes student programs from the 
University of Dublin, carried out between 2016 and 2019. 
Although the original corpus [3] contains nearly 600,000 pro- 
grams (Python and Bash), we propose here a subset enriched 
semi-automatically with test cases (not provided initially). 


4.2 Embedding analysis 

In the following experiments, each dataset has been divided 
into three sub-parts: training (90%), validation (5%) and 
test (5%); the validation set being used to select the best 
model among those learned during the different iterations 
(aes2vec). Unless otherwise stated, the aes2vec algorithm 
has been set up to learn embeddings of dimension 100, the 
size of the context is set to 2, concatenation is used as ag- 


Table 2: Exercises in the NewCaledonia-1014 dataset. 


Exercice Statement #test 
cases 

swapping swap items in a list 4 

minimum look for the minimum in a list | 8 

comparestrings | compare two strings 11 

fourMore100 return the first four values | 7 
greater than 100 from an input 
list 

indexOccurrence | return the index of the first oc- | 7 
currence of an item in a list 

compareDates compare two dates from their | 30 
day, month and year 

polynomial return the roots of a polyno- | 6 
mial of degree 2 

day Night display information about the | 32 
period of a day given a time 


gregation stage and the training is performed over 500 iter- 
ations. 


In a first step, we evaluate our approach in a qualitative 
way on the dataset NewCaledonia-1014 constituted for this 
purpose. Figure 4 (left) shows a visualization of the 912 
programs of the training set, obtained by a non-linear pro- 
jection using the t-SNE dimension reduction algorithm [9]. 
It can be seen that, although embeddings are learned in an 
unsupervised manner, the code2aes2vec method learns, from 
a relatively limited number of training data, a representa- 
tion space in which the areas identify distinct program func- 
tionalities. Thus, the program vectors are organized quite 
naturally into 8 clusters that are highly correlated with the 
original 8 exercises. Moreover, the topological organization 
of the clusters matches well with the algorithm inherent to 
the programs. Exercises ’swapping’ and ’minimum/’ are close 
in the embedding space and correspond to the only two ex- 
ercises that iterate over all values of an input list. Exercises 
*compareStrings’, *fourMore100’ and ’indexOccurrence’, in 
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the upper part of the space, require to partially iterate over 
a list. Finally, the last three exercises do not require loops 
and rely only on the use of conditional instructions. The 
exercise ’dayNight’ is distinguished by the expected use of a 
display function (print) while all the other exercises return 
a result (return). This feature may explain the ‘isolation’ 
of the programs from this exercise from other programs. 


Stylistic differentiation of programs is difficult to assess 
quantitatively. This task would require either objective cri- 
teria that can be extracted automatically or expensive ex- 
pert labeling. Given the absence of such stylistic knowl- 
edge in datasets, we choose to illustrate on an example how 
the code2aes2vec method distinguishes different styles in the 
writing of a same function. Figure 4 (right) presents in de- 
tail the embedding space learned for the exercise ’minimum’. 
It’s interesting to notice that program styles are clearly dis- 
tinguished, notably the two ways to program a Python for 
loop (using indexes vs. elements directly). In a very detailed 
way, programs are also grouped according to whether their 
loop starts with the first element of the list (range(0,...)) 
or with the second one (range(1,...)), after initialization 
of the minimum to the first element in any cases. 


Word order is important in our aes2vec method. For ex- 
ample, our approach distinguishes programs having a simi- 
lar boolean condition but expressed in a different order (if 
liste[il<res: vs. if res>liste[i]:). This distinction 
may seem artificial since these two expressions are strictly 
equivalent from the evaluation point of view. However, the 
first syntax appears more ’natural’ than the second. It would 
be easy to get rid of this phenomenon by normalizing the 
expressions at the code2aes step; this option can be left at 
teacher’s discretion. 


Finally, we draw the reader’s attention on some valid but 
atypical programs. In particular a program using the native 
Python function min, or the one using the sort function. 
Their separation from the rest of the programs is crucial 
since it offers a way to detect programs (a priori valid) that a 
teacher would like to reject or at least moderate considering 
that they deviate from his/her pedagogical objective. More 
generally, this analysis seems to confirm that the embedding 
spaces learned by the code2aes2vec method correctly cap- 
tures not only the function of the programs but also their 
style. Their intrinsic quality paves the way for many prac- 
tical uses that could significantly improve the efficiency of 
learning platforms (detection of atypical solutions, automa- 
tion/propagation of feedbacks, student ’trajectory’ analysis, 
study of error typologies, etc.). 


In a second step, we evaluate the code2aes2vec approach 
from a quantitative point of view on the three datasets (Ta- 
ble 3). We consider an usual task for program embedding 
evaluation, namely the prediction of its function (i.e. exer- 
cise identification). For each considered configuration (AES 
level), the training data are used first to learn (without su- 
pervision) a representation space. Then, these embeddings 
are used to learn (with supervision) a SVM classifier (with 
polynomial kernel) [5]. Finally, the embeddings are (in- 
directly) assessed according to their ability to predict the 
function of the code (i.e. the exercise) for test data. 


As baselines, random classifier informs about the difficulty 
of this task a priori, while doc2vec corresponds to the (naive) 
use of the algorithm doc2vec [8] to learn embeddings from 
the codes directly (without any intermediate representa- 
tion). We also report the results obtained by the (super- 
vised) code2vec approach? [2] executed with default param- 
eters. 


NC-1014 NC-5690 = Dublin-42487 


random classifier 0.125 0.015 0.009 
code2vec [2] 0.230 0.098 0.037 
doc2vec [8]+SVM 0.412 0.495 0.380 
code2aes2vect+SVM 
(AES-0) 0.882 0.460 0.391 
(AES-1) 1.0 0.698 0.544 
(AES-2) 1.0 0.832 0.651 


Table 3: Quantitative and comparative evaluation of the pro- 
duced embeddings, on the task of retrieving the function of a 
program (accuracy). 


It can be seen that the code2vec model recently proposed 
by [2] cannot be trained satisfactorily on any of the three 
datasets. This is due to the numerous parameters to 
learn and the large number of examples this method re- 
quires. In the largest dataset (Dublin-42487), code2vec 
“only” has 42,487 programs as inputs. To the opposite, our 
code2aes2vec method as several million entries thanks to our 
AES intermediate representation. 


The comparative results obtained with three different levels 
of AES (denoted by AES-0, AES-1 and AES-2 in Table 3) 
confirm that the quality of the embeddings is improved when 
the level of detail of the AES increases. Level 2 AES (AES- 
2) are undeniably leading to the best vector representations 
of programs. 


In order to take into account the word/token order, 
code2aes2vec and doc2vec have been set up so far with con- 
catenation as aggregation step. In order to confirm the im- 
portance of the order, we compare in Figure 5 embeddings 
obtained by the code2aes2vec algorithm with both types of 
aggregation (sum vs. concatenation). Unlike for concatena- 
tion, we observe a very rapid degradation in the quality of 
the embeddings obtained with a sum type aggregation when 
the size of the context increases. Indeed, the vocabulary on 
which AESs are based is very limited (only a few dozen or 
even hundreds of tokens) and the distribution of these words 
in AESs is not uniform. Thus it quickly becomes difficult to 
differentiate contexts as their size increases without taking 
into account the word order. 


4.3 Application to feedback propagation 

In order to confirm that the learned embeddings are fine 
enough to be usefully exploited in an educational context, 
we have implemented a first proof of concept on the task of 
propagating feedbacks. 


We have considered the exercise ’mean’ from the dataset 
NewCaledonia-5690 whose instruction was to write a Python 


>The other methods presented in the state of the art could 
not be compared because of the lack of available operational 
implementations. 


258 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


-40 


e 2 %e ee 
comparéStrings ° 
of Be 


é be -% 
ae wh. *, P ts 
indexOceurrence “a ~ urbe 100 e 
er, a “ue 4 
2 ? 
<° a . @ 3 
polynomial > “y,. _ Sepe 
c mpareDat ee eo MeL . ® ee oe ore 
oe = “es 
ss . 
ae 3 © 9 


-40 


-20 ) 


40 


-10.0 


-12.5 


-15.0 


-17.5 


-20.0 


for eli 


path by the 
elements 


functi 


e if el : 
° RS Pesaee, (if resoliste(i]: 
\ Pid e? e 
% yf Te. JP) gee 
‘ 
% - ‘ ° 
Sd = ¢ 
e . 20 
e ¢ J 


uses the native 


« 
in liste: 4 
Ses for i in range(0,len(liste)): 


‘ 
‘ for i in range(1,len(liste)):  ° 
z if res>listeLi]: * 

- e 


e * uses the native 
on function sort( 
e 
path by > 
atemiate [tes © for i in range(0,len(liste)): 
if listeLi]<res: 
© for i in range(1,len(liste)): 
if listeLi]<res: 
° 
ee e 
ion min() ° 


0) 


5 10 15 20 


Figure 4: Visualization of program embeddings obtained by the method code2aes2vec for the 8 exercises of the dataset 
NewCaledonia-1014. The colors identify the exercises, with incorrect programs in light and correct ones in dark. The fig- 
ure on the left represents all the embeddings and the one on the right details the area associated with the minimum’ exercise. 
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Figure 5: Evaluation of the embeddings on each of the three datasets according to the type of AES, the aggregation method and 


the context size 


function returning the mean of the values contained in a list 
passed as a parameter (and returning None if this list is 
empty). For this exercise, 157 programs were submitted to 
the platform by 24 different students ; among these sub- 
missions 122 programs were evaluated as incorrect by the 
platform and on which we sought to propagate teacher feed- 


(nb. words before/after). 


backs based on their embeddings. 


help the student correct his proposal®. Once the k feedbacks 
were compiled (one per cluster/medoid), we went through 
each cluster and asked the teacher, for each program (other 
than the medoid), to indicate whether the feedback defined 
for the medoid could be applied to that other program. The 
objective is thus to assess the extent to which feedback from 


one medoid can propagate to all other programs in the same 


cluster. 


For this purpose, we performed a clustering (k-means [10]) 


on the 122 incorrect programs. 
teacher the most representative program (cluster medoid) of 
each cluster obtained, asking him to provide one feedback to 
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We then presented to a 


Operationally, clustering is performed on the 122 incorrect 


°Tf more than one error is found, the teacher must choose the 


one he/she feels needs to be corrected first. Each feedback 
is thus limited to the resolution of a single error. 
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programs, defined by their embeddings in IR‘. For a fixed 
number of & clusters, the partition selected to be analyzed 
is the one minimizing the MSE (Mean Square Error) among 
100 runs (random initializations) of the k-means. 


Table 4 presents the partition obtained for 5 clusters (k = 5). 
For each cluster we indicate its size, the program associated 
with its medoid as well as the feedback provided by the 
teacher for this medoid. 


Let us first observe that the feedbacks provided by the 
teacher may relate to errors different in nature. It can be 
either an error in the design of the algorithm (clusters 1 and 
3) or an error in the writing of the Python program (clusters 
2, 4 and 5). 
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Figure 6: Propagation of teacher feedback to neighboring pro- 
grams in each cluster. Illustration on the exercise mean. 


We repeated this work with a partition into 10 clusters, this 
time asking the teacher to provide 10 feedbacks. Finally, 
we measured the rate of correct feedback when propagating 
the feedback from each medoid to its neighboring programs 
in the same cluster. Figure 6 shows the evolution of the 
correct feedback rate as a function of the neighborhood size 
considered for propagation. It can be seen that the further 
away from the medoids, the more errors in the automatically 
determined feedbacks. 


Of course, a small number of feedbacks (5 or 10) is not 
enough to cover all the errors present in the 122 incorrect 
programs. In practice, it will therefore not be envisaged to 
propagate to all the programs in the cluster but only to the 
neighboring programs of the medoid. The dashed curves in 
Figure 6 indicate the proportion of programs covered as a 
function of the size of the neighborhood under considera- 
tion. We can see that with a neighborhood radius corre- 
sponding to a distance of 1.0, a propagation over 5 clusters 
allows to cover 33% of the programs with a precision of 90%; 
similarly such a propagation on 10 clusters allows to cover 
significantly more programs (43%) with a higher precision 


(93%). 


The use of embeddings allows in this example to assist the 
teacher in his task of accompanying the students. For each 
feedback requested, 4 to 5 additional neighboring programs 
are automatically processed. Moreover, it is reasonable to 
think that over time a sufficiently large collection of feed- 
backs will be defined by the teacher to cover almost the 
entire embedding space so as to systematically identify one 
relevant feedback each time a new incorrect program is sub- 
mitted and its embedding will position it in a pre-identified 
neighborhood. 


5. CONCLUSION AND PERSPECTIVES 


This paper studies the problem of learning vector representa- 
tions, or embeddings, of programs in an educational context 
where the function is just as important as the style. Faced 
with this problem, we propose the method code2aes2vec 
transforming the code into abstract execution sequences 
(AES), and then into embeddings. This approach adapts 
the doc2vec method for program application and is based 
on the document-based distributional variations hypothesis. 


The publication of the source code of the approach is ac- 
companied by the availability to the community of a new 
enriched ’corpus’ composed of more than 5,000 student pro- 
grams (Python). Experiments conducted on these new data 
and on a public data set validate the quality of the learned 
embeddings, capturing in a fine way the function and the 
style of the programs. In addition a promising proof of con- 
cept was carried out on a classical task in the field, namely 
the propagation of teacher feedbacks. 


The perspectives of this work are numerous. First, our 
experimentation focus on programs done in introductory 
courses, i.e. pretty simple codes. It would be interesting to 
analyze more elaborated ones (from more advanced courses) 
and to evaluate the impact of code complexity on perfor- 
mance. 


Then, it seems necessary to complete the results observed 
on the stylistic differentiation of programs, by formalizing 
the notion of style of a program, in order to quantitatively 
evaluate our program embeddings. In the same way, a more 
precise analysis of the test cases used will have to be carried 
out in order to determine to what extent the constructed 
embeddings are sensitive to them. 


Finally, to have more exploitable corpora, we plan to extend 
our implementation to handle any type of language (the cur- 
rent implementation only processes the Python language). 


From a more methodological perspective, all the words in 
the program have the same weight during the embedding 
construction in our approach. Thus, a correct program and 
one returning a wrong value (or throwing an error) may have 
very similar embeddings, although functionally very differ- 
ent. This aspect could be integrated in the construction of 
our AES or in the architecture of the neural network used to 
generate embeddings. For that, it could also be interesting 
to add to our AES the values taken by the variables, in the 
same way that [17] but in a generic multimodal approach. 
Another perspective would be to allow the expert to inte- 
grate part of his knowledge on the language. As discussed 
previously, some instruction sequences can be equivalent 
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Cluster 1 (#31) 


Cluster 2 (#22) 


Cluster 3 (#28) 


Cluster 4 (#9) Cluster 5 (#32) 


def mean(1): 
if len(1)==0: 
res=None 
else: 
res=0 
cpt=0 
for elem in 1: 
res=restelem 
cpt=cptt+1 
res=elem/cpt 
return res 


def mean(1): 

if len(1)==0: 
res=None 

else: 
s=0 
for elem in 1: 

st=elem 

res=s//len(1) 

return res 


def mean(1): def mean(1): def mean(1): 
if len(1)==0: if 1==(): if len(1)==0: 
res=None res=none res=None 
else: else: else: 
res=0 res=0 res=0 
cpt=0 for i in range(len(1)): cpt=0 
avg=0 x=rest1 [i] for elem in 1: 


for elem in 1: 
res=rest+elem 
cpt=cptt1 

avg=res/cpt 

return avg 


res=res+elem 

cpt=cptt+2 

res=resfcpt 
return res 


res=x/len(1) 
return res 


The division step 
must be performed 
once the sum calcu- 
lation is completed 
(put this instruc- 


The // operator 
corresponds in 
Python to the in- 
teger division. For 
the computation of 


In the case of an 
empty list, your 
function does not 
return None (as re- 
quested). 


tion out of the for 
loop). 


a mean a simple 
division is required 
(operator /). 


The null value in Python 
is written ’None’ (instead of 
none’). 


The % operator cor- 
responds in Python 
to the modulus. For 
the computation of a 
mean a simple divi- 
sion is required (op- 
erator /). 


Table 4: Description of the 5 cluster partition generated by k-means on the embeddings of the incorrect programs from the 
mean exercise. For each cluster (table column): (1) the number of programs, (2) the program associated with the medoid of 
the cluster and (3) the feedback defined by the teacher for this program. Instructions in red are the ones that are questioned 


in t 


(e.g., if liste[i] <res: 


he feedback. 


vs. if res> liste[i]:). Se 


mantic relations between words can also be known (e.g., the 
relation between for and while statements). This knowl- 
edge could be used to constrain the neural network and 
guide embedding construction. Finally, these program em- 
beddings open up a large number of perspectives for teaching 


aid, 


in addition to the task of feedback propagation. For ex- 


ample, they could be used to identify error typologies, alter- 
native solutions, or even predict dropout students through 


the 


[1] 


[4] 
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analysis of their ’trajectories’. 


REFERENCES 

M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton. 
A survey of machine learning for big code and 
naturalness. ACM Computing Surveys, 51(4):1-37, 
2018. 

U. Alon, M. Zilberstein, O. Levy, and E. Yahav. 
code2vec: Learning distributed representations of 
code. Proceedings of the ACM on Programming 
Languages, 3(POPL):1-29, 2019. 

D. Azcona, P. Arora, I.-H. Hsiao, and A. Smeaton. 
user2code2vec: Embeddings for profiling students 
based on distributional representations of source code. 
In Proceedings of the International Conference on 
Learning Analytics €& Knowledge, pages 86-95, 2019. 
R. Bazzocchi, M. Flemming, and L. Zhang. Analyzing 
csl student code using code embeddings. In 
Proceedings of the 51st ACM Technical Symposium on 
Computer Science Education, pages 1293-1293, 2020. 
B. E. Boser, I. M. Guyon, and V. N. Vapnik. A 
training algorithm for optimal margin classifiers. In 
Proceedings of the fifth annual workshop on 


[7] 


[11] 


[12] 


Computational learning theory, pages 144-152, 1992. 
D. DeFreez, A. V. Thakur, and C. Rubio-Gonzalez. 
Path-based function embedding and its application to 
error-handling specification mining. In Proceedings of 
the ACM Joint Meeting on European Software 
Engineering Conference and Symposium on the 
Foundations of Software Engineering, pages 423-433, 
2018. 

J. Henkel, S. K. Lahiri, B. Liblit, and T. Reps. Code 
vectors: understanding programs through embedded 
abstracted symbolic traces. In Proceedings of the ACM 
Joint Meeting on European Software Engineering 
Conference and Symposium on the Foundations of 
Software Engineering, pages 163-174, 2018. 

Q. Le and T. Mikolov. Distributed representations of 
sentences and documents. In International conference 
on machine learning, pages 1188-1196, 2014. 

M. LJPvd and G. Hinton. Visualizing 
high-dimensional data using t-sne. J Mach Learn Res, 
9:2579-2605, 2008. 

J. MacQueen et al. Some methods for classification 
and analysis of multivariate observations. In 
Proceedings of the fifth Berkeley symposium on 
mathematical statistics and probability, volume 1, 
pages 281-297. Oakland, CA, USA, 1967. 

T. Mikolov, K. Chen, G. Corrado, and J. Dean. 
Efficient estimation of word representations in vector 
space. arXiv preprint arXiv:1301.3781, 2013. 

T. D. Nguyen, A. T. Nguyen, H. D. Phan, and T. N. 
Nguyen. Exploring api embedding for api usages and 
applications. In IEEE/ACM International Conference 
on Software Engineering, pages 438-449. IEEE, 2017. 


261 


[13] H. Peng, L. Mou, G. Li, Y. Liu, L. Zhang, and Z. Jin. 
Building program vector representations for deep 
learning. In International Conference on Knowledge 
Science, Engineering and Management, pages 
547-553. Springer, 2015. 

[14] J. Pennington, R. Socher, and C. D. Manning. Glove: 
Global vectors for word representation. In Proceedings 
of the 2014 conference on Empirical Methods in 
Natural Language Processing, pages 1532-1543, 2014. 

[15] C. Piech, J. Huang, A. Nguyen, M. Phulsuksombati, 
M. Sahami, and L. Guibas. Learning program 
embeddings to propagate feedback on student code. In 
Proceedings of the 32nd International Conference on 
Machine Learning, ICML’15, page 1093-1102, 2015. 

[16] G. Salton, A. Wong, and C.-S. Yang. A vector space 
model for automatic indexing. Communications of the 
ACM, 18(11):613-620, 1975. 

[17] K. Wang, R. Singh, and Z. Su. Dynamic neural 
program embeddings for program repair. In 
International Conference on Learning Representations, 
2018. 


262 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


Student-centric Model of Login Patterns: A Case Study 
with Learning Management Systems 


Varun Mandalapu', Lujie Karen Chen’, Zhiyuan Chen", Jiaqi Gong? 
University of Maryland Baltimore County, Baltimore, Maryland, USA, 21250 
?The University of Alabama, Tuscaloosa, Alabama, USA, 35487 


{varunm1, lujiec, znchen}@umbc.edu, jiagi.gong@ua.edu 


ABSTRACT 


With the increasing adoption of Learning Management Systems 
(LMS) in colleges and universities, research in exploring the 
interaction data captured by these systems is promising in 
developing a better learning environment and improving teaching 
practice. Most of these research efforts focused on course-level 
variables to predict student performance in specific courses. 
However, these research findings for individual courses are 
limited to develop beneficial pedagogical interventions at the 
student level because students often have multiple courses 
simultaneously. This paper argues that student-centric models will 
provide systematic insights into students’ learning behavior to 
develop effective teaching practice. This study analyzed 1651 
undergraduate student's data collected in Fall 2019 from computer 
science and information systems departments at a US university 
that actively uses Blackboard as an LMS. The experimental 
results demonstrated the prediction performance of student-centric 
models and explained the influence of various predictors related 
to login volumes, login regularity, login chronotypes, and 
demographics on predictive models. Our findings show that 
student prior performance and normalized student login volume 
across courses significantly impact student performance models. 
We also observe that regularity in student logins has a significant 
influence on low performing students and students from minority 
races. Based on these findings, the implications were discussed to 
develop potential teaching practices for these students. 


Keywords 


Student-centric Modeling, Learning Management Systems, Login 
Variables, Student Performance Prediction. 


1. INTRODUCTION 


Teaching and learning changed a lot in recent years with the 
increasing adoption of new computer-based teaching and learning 
technologies in educational institutions worldwide. As education 
and learning technology evolves with time, leveraging the 
technical advances to improve teaching practice and student 
learning will be a prominent research area. The most common 
technologies used by instructors to deliver course content include 
Learning Management System (LMS), Course Management 
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Systems (CMS), and Learning Content Management Systems 
(LCMS) [1]. Even though these systems seem to be synonymous, 
they have their specific use in the education domain. LMS tools 
focus on communication, collaboration, content delivery, and 
assessment, whereas LCMS is similar to LMS with fewer 
administrative functions. CMS, on the other hand, will focus on 
the enrollment and performance of students. Of these three 
systems, LMS is the one that is best suitable for delivering 
learning strategy to students and is the primary focus of this study. 


LMS systems provide a unique opportunity to administrators, and 
researchers to evaluate student data related to time spent on an 
activity, access times and day, grades, interactions, and many 
other useful student learning variables. The data logs collected by 
LMS systems are analyzed with scientific techniques published in 
the Educational Data Mining (EDM) domain. In their study, 
Romero and Ventura [2] described that current EDM methods rely 
on clustering and pattern recognition techniques to categorize 
students into various groups based on their interaction patterns. 
Categorization of students using clustering and pattern recognition 
supports instructors in making changes for a set of students. 
Teaching practices that impact the entire classroom can be 
evaluated using predictive analytics that tracks student learning 
and achievement from the vast amount of interaction data 
collected by LMS. 


Existing research in Learning Analytics (LA) and EDM focused 
on developing highly accurate predictive models that can estimate 
student learning outcomes related to assignment scores, course 
grades, and drop-out probability [3,4]. These course-based 
predictive models provide early warning to student counselors or 
instructors associated with a specific course [5,6]. Even with 
considerable success in this area, many of the student performance 
prediction models have several shortcomings. One significant 
issue with course-based models is the bias introduced by teaching 
style and the type of course (descriptive, programming, 
mathematical etc.). This bias impacts these models' scalability 
across different courses and makes it difficult to understand the 
student level factors on their achievement. For example, if a 
student enrolls in five courses, developing models to study 
students’ progress in these five courses independently is not 
realistic and gives different insights based on varying features and 
performances. Therefore, these modeling efforts are limited to 
reduce different biases introduced by instructor and the diverse 
amount of content made available in LMS. 


Course level predictions are suitable for supporting instructor 
level decision making; however, if intervention is on student level 
behaviors such as study habits or self-regulation skills, it is 
beneficial to look at student-centered indicators so _ that 
interventions may be more targeted and cost-effective [7,8]. 
Developing student-centric models that analyze student LMS 
interactions across courses in a college/university setting will help 
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address the issues with course specific models. This study is the 
first step in developing models that supports the identification of 
student level indicators. 


For colleges that have a high penetration of LMS, LMS activity 
may give a holistic indicator of students' engagement level 
(behavior engagement specifically). We ask the question to what 
extent those holistic indicators predict student term Grade Point 
Average (GPA) performance in the future. To explore this, we 
specifically focus on student login related features as they can be 
generalized across courses and act as proxy variables for time 
management [51, 52]. It is also challenging to aggregate other 
features like discussions, readings, and assessments across courses 
compared to access related LMS variables. This study's data is 
drawn from Blackboard Learn, a commercial LMS software 
available for colleges and universities to deliver course content 
and assessments through internet-enabled computer systems. Most 
importantly, the data is drawn from all students in computer 
science and information systems at a large public university in the 
US during the Fall 2019 semester. In addition to student 
interactions from LMS, we also access demographic and prior 
student performance data from the university's student 
administration system to build and interpret downstream 
predictive models. The university's Institutional Review Board 
(IRB) approved this study, and all the student specific 
demographic and personal information are anonymized by 
following General Data Protection Regulation (GDPR) standards. 


In this work, we focus on model predictions and explanations to 
understand student learning behaviors. First, we apply new 
methods to process student interaction data collected across 
different courses enrolled in a semester to build student-centric 
performance models based on machine learning principles. 
Secondly, we utilize a novel approach in local model 
explanations, correlation and regression to understand the impact 
of various features captured by LMS on student performance. One 
primary reason for using Locally Interpretable Model 
Explanations (LIME) is its ability to explain the relationship 
between predictor variables and predictions, especially the input 
variable's impact on the outcome. On the other hand, statistical 
correlation analysis will provide the relation between input 
predictors and the observed target variable. As correlation 
analysis does not consider the interaction effect between input 
variables, we also use a linear regression model to study the 
output variable's feature importance’s based on the model 
coefficients. To address the research gap discussed earlier, we 
explore the below three research questions. 


RQ1 How different student-centric machine learning models 
perform in predicting student end-of-term GPA? 


RQ2___How do student login and time interval pattern across 
courses influence student learning outcomes? 


RQ3 _Is there a significant variability in feature importance 
for students coming from diverse demographics? 


2. RELATED WORK 


Universities and colleges around the world adopted LMS systems, 
such as Moodle and Blackboard, to provide onsite, hybrid, and 
online courses based on their capabilities to support 
communication, content creation, administration, and assessment 
[9, 10]. Besides the automation and centralization of various 
administrative tasks like creating and managing student accounts, 
creating syllabus, assignments, assessments, grading, etc., LMS 


systems assemble and deliver personalized learning materials and 
content quickly [11]. These systems also support the reusability of 
materials created by instructors. The systems also enable 
instructors to create content structures, deliver them in a sequence, 
maintain control access, organize group activities, track student 
activities, load and replace learning materials and provide 
feedback on assessments. With advanced database software 
developed by Oracle, IBM, and Microsoft that emphasize 
interconnectedness, data independence, and security, LMS 
systems employ various login roles based on user classification. 
These roles will permit instructors to create new content or 
privately address student issues and create discussion boards to 
capture student knowledge on specific topics. 


LMS platforms enable students to access learning material in 
various formats, such as pdf, PowerPoint presentations, video 
lectures, and audio files. The systems also track student activity 
related to content downloads, access timestamps to display 
student progress in learning to instructors [12]. LMS also provides 
both asynchronous and synchronous communication for students 
to interact with instructors and encourages group activities. 
Combining the tools provided by LMS with innovative learning 
strategies like self-directed learning, small group instructions, and 
collaborative learning with instructor interventions, a wide variety 
of activities can be developed for individual, small groups, or 
larger classes [12,13]. Given the simplicity and convenience of 
accessing online materials through LMS systems, it is not 
surprising to see high student satisfaction scores for courses 
delivered through LMS. 


In recent years, the data sets related to student learning activities 
have drawn significant attention from researchers in academic 
communities to develop possible solutions to address student 
retention and academic success issues. This type of work has been 
called learning analytics and focuses on student activities such as 
navigating lecture materials, what information is accessed, how 
long it takes to complete an activity, and how students transform 
the information in learning materials into measurable learning [14, 
15]. Multiple commercial resources like SPSS, google analytics, 
Stata and Nvivo can build predictive models on data captured by 
LMS to assess student drop-out probabilities to develop targeted 
learning courses or model collective learning behaviors. Since 
most instructors deliver course assessments and material through 
LMS, they can track student activity by processing a digital 
footprint during every online interaction captured by system log 
files. 


2.1 Learning Analytics Research 

LMS systems have the ability to capture large data streams related 
to user interactions through which administrators and instructors 
can develop methods to improve the learning experience. The 
collection, analysis, and reporting of data about learning activities 
on web-enabled learning platforms to assess student academic 
progress, predict performance, and identify potential issues that 
need attention is the central proposition of emerging fields like 
learning analytics and educational data mining [16, 17]. Outcomes 
derived from learning analytics aim to gain insights about student 
learning behaviors, real-time information about institutional 
practices and support the designing of personalized courses in 
CMS. Although there are huge data stores in universities and 
colleges that can be used to make data-driven decisions to support 
optimal use of both pedagogical and economic resources, to date 
there has been minimal application of this data in higher education 
[18]. 
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2.1.1 Student Engagement and Frequency of LMS 
Use 


An LMS system records student interaction details related to 
logins, number of posts written on discussion threads, time spent 
on lecture materials, total downloads, etc., in their log files. These 
logs can be analyzed to generate reports that help teachers to 
observe student progress at a granular level. Once there are 
enough student records collected in LMS, they can be used to 
develop computational models to predict future student 
performances. Multiple works in EDM and LA studied the 
relation between the usage of LMS and student academic 
achievements. Vengroff and Bourbeau's [19] study showed 
evidence that providing additional material in LMS benefited 
students at the undergraduate level. They also conclude that 
students who used LMS regularly did better in exams than their 
peers who have minimal interactions. In their research, Dutt and 
Ismail [20] observed that tracking resources students interact with 
on LMS supports developing new strategies that make learning 
easier and enhance learner progress. Their work also focused on 
analyzing thresholds related to student interaction features like 
self-assessment tests, time spent on exercises, discussion forums, 
and performance outcomes. Another study by Lust et al. [21] 
explored the usage variations in different tools used by students 
on LMS, such as time on web-link, time on web-lectures, time on 
a quiz, time on feedback, postings on discussion board, and 
messages read. The results from this study heavily contributed to 
the development of adaptive and innovative recommendation 
systems. In their work, Hung and Zhang [22] also found patterns 
based on six indices that represent student effort: Frequency of 
accessing the course material, number of LMS logins, total 
interactions in discussion threads, number of synchronous 
discussions, number of posts read, and final grades in a course. 


While exploring a link between student online activity on LMS 
and their grades, Dawson et al. [23] observed a significant 
difference in the number of online sessions accessed, total time 
spent, and the number of posts in discussion forums between high 
and low performing students. Another study by Damainov et al. 
[24] developed a multinomial logistic regression model based on 
time spent in LMS. This study found a significant relationship 
between student time spent and grades, especially in students who 
attained lower grades between D and B. Instead of using time 
spent online, other works focused on the frequency of course 
material access within LMS. A study by Baugher et al. [25] found 
that regularity in student hits is a reliable predictor of student 
performance compared to the total number of hits. In their study, 
Chancery and Haque analyzed student interaction logs of 112 
undergraduate students and found students with low LMS access 
rates obtained lower grades than their peers with higher access 
rates. This study was complemented by Biktimirovan and Klassen 
[26] that reported a strong relationship between student hit 
consistency and success. Their study counted access to various 
LMS activities and found that homework solution access is the 
only strong predictor of student performance. However, these 
studies are primarily descriptive rather than predictive. 


2.1.2 Instructional Design and Student Participation 
Online teaching strategies are primarily dependent on instruction 
design as each mode of interaction -  student/instructor, 
student/student, and student/content have their own positive 
impacts on student progress. A study by Coldwell et al. [27] 
focused on the relationship between student participation in a 
fully online course and their final grades. They found a positive 
relationship between student participation and final grade. 


Dawson et al. [23] examined the impact of various LMS tools and 
found a highly positive correlation between discussion forum 
activity and student success. They observed more than 80% of 
interactions occurred in the discussion forum, which is the 
primary interaction tool in LMS. Another study by Greenland [28] 
found that asynchronous communication is the primary form of all 
online course interactions. Nandi et al. [29] found an increasing 
number of posts in discussion forums close to assignment and 
exam deadlines. They also found a high correlation between exam 
scores and online class participation throughout the semester, 
especially in high-achieving students. 


All the studies discussed above adopted log files from LMS 
systems to extract unbiased details from activity and performance 
to identify a relationship between independent interaction 
variables and student grades. Most of the discussed studies are 
based on univariate analysis focusing on a single variable or a set 
of highly impactful variables of a single course or similar courses 
on student outcomes. However, student performance is a highly 
complex area in education to measure or understand, especially 
across various courses offered on-campus in a university setting. 
Most of the authors discussed above noted the need for more in- 
depth works to investigate student performance across courses and 
based on multiple variables. These studies also lack an 
explanation about variables used in their studies to track student 
performance, and it is evident that the authors selected LMS 
variables based on their belief that these variables are highly 
correlated with student scores. 


2.1.3 Social Factors in Analytics 

Factors that influence student academic performance have been 
the focus of researchers in LA and EDM domains for many years. 
It still remains an active area of education research, indicating the 
complex problem in measuring and modeling learner processes, 
especially in tertiary education. Positive learning characteristics 
have a significant positive impact on learner engagement 
improvement in multiple ways. The dispositional language 
specifies learning as a combination of self-regulation, learning 
inclinations, motivation, behavioral patterns, interactions, and 
cognitive ability. In their study, Buckingham et al. [30] proposed 
a combination of self-reported data gathered in surveys with 
student interaction data generated by LMS to study individual 
student performance, learning processes, and group interactions. 
These social analytics depend primarily on student self-reported 
data to develop toolkits that support a specific learning type, 
especially in courses with high diversity [31]. However, our study 
focuses on objective identification of student success based on 
data that LMS captures. We will also identify the crucial variables 
from predictive model output for various student groups based on 
their diverse backgrounds (race, gender, and student status). 


2.1.4 Multivariate Analysis to Predict Student 


Success 

Even though there is a common agreement about the purpose of 
learning analytics, there are still several varying opinions on what 
data needs to be collected and analyzed to improve teaching and 
learning processes. A study by Agudo-Peregrina et al. [32] argued 
that it is highly complex to identify the net contribution of various 
interactions to the learning processes. Their findings show that 
peer interaction between students has a lower influence than 
student-teacher interaction, which contradicts earlier studies that 
showed high importance for student peer interactions. A study by 
Dominquez et al. [33] utilized multiple variables like LMS logins, 
time stamps, and content access flags captured in a biology course 
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to predict student grade at the end of course completion. The 
results show that the algorithm predictive accuracy is at 50% in 
subsequent semesters. Lerche and Keil's [34] recent study utilized 
Moodle log data from 369 students enrolled in three online 
courses across three semesters to predict their scores at the end of 
the term for each course. Their regression results related to 
predicting student scores in a course at the end of the semester 
varied from 0.17 to 0.6 for all three courses. This broad range of 
performance across courses is due to varying variables utilized in 
each course based on the course structures. Studying the 
difference in instructional design, variables in extracted data, 
statistical inferences, predictive modeling used, interpreting model 
outcomes and pattern observations, etc., might explain the 
inconsistencies in results shown in earlier studies. 


Data captured by LMS systems became prominent in LA and 
EDM circles as they capture student interactions in non-intrusive 
and ready-to-use settings. Several studies were discussed earlier in 
this research that utilized the LMS data to develop models that 
track student progress. However, it is still challenging to build 
highly accurate models that predict student learning outcomes 
across courses and understand the impact of different variables 
captured by LMS. Another significant gap in earlier research is 
their inability to predict student performance across courses in a 
given semester. One primary issue in predicting student 
performance across a semester is to find methods that aggregate 
student LMS variables across courses. This research shows 
methods to address the research gap found in earlier studies. 


In this study, we approach the problem of tracking student 
achievement by developing student-centric models that build on 
aggregated LMS interaction variables collected across a semester 
irrespective of student year and course. One unique aspect of our 
work is related to the study of model performance on longitudinal 
student data. We develop models that predict student end-of-term 
GPA based on four cumulative periods in a semester. This work 
also focuses on explaining the impact of different aggregated 
LMS variables on various student groups categorized based on 
performance, race, gender, and student type. The importance of 
features is explained by adopting correlation statistics for 
univariate importance, a regression model for interaction effect, 
and LIME for model-based yet model agnostic explanations. 


3. DATA & FEATURE SET 


3.1 Dataset 

For this study, we chose undergraduate student data captured by 
LMS in Fall 2019 from a large public university in the United 
States. These students were part of either Information Systems 
(IS) or Computer Science (CS) departments. The students from 
these departments were chosen as the instruction format and 
courses are closely aligned in both of them. Blackboard system is 
predominantly used as an LMS to deliver course material, 
assessment, and grading. The student demographic data captured 
by a standalone Student Information System (SIS) is used to 
categorize students based on different demographic variables. A 
total of 1651 students were enrolled in these two departments in 
the Fall 2019 semester. Based on student distribution, we 
categorized students into three ethnicities: White, Asian, and 
Minority. This study also researches student performance based 
on their admit types, such as four-year regular student or transfer 
student. The demographics of student data are provided in the 
below table 1. This study was approved by IRB and sensitive 
student data was de-identified based on GDPR standards. 


Table 1. Student demographics 


Demographic Student Count 
Total Students (N) 1651 
No of unique courses 440 
No of unique course instructor 638 


combinations 


Male : Female 1302 (79%) : 369 (21%) 


630 (38%) : 495 (30%) : 


White : Asian : Minority 526 (32%) 
0 


4 — Year : Transfer 976 (59%) : 675 (41%) 


Full Time ; Part Time 1446 (88%) : 205 (12%) 


IS: CS 934 (57%) : 717 (43%) 


115 (7%) : 329 (20%) : 515 


lst Yr: 2nd Yr: 3rd Yr: 4th Yr (31%) : 692 (42%) 


298 (18%) : 1035 (63%) : 


<= 3: 4-5: >5 (Courses enrolled) 318 (19%) 
0 


3.2 Feature Extraction 

We explored various LMS features related to student logins, 
content accesses, time spent, discussion posts, assignment 
submissions, and time intervals based on earlier literature. While 
exploring these features, we identified that only three features 
could be commonly extracted from different courses: Student 
Login Counts, Time intervals & prior knowledge. 


One of the significant challenges while building a student-centric 
model on LMS data is to extract aggregated features that are least 
biased. As Blackboard's content is dependent on instructor and 
course, it is crucial to mitigate the variations caused by these 
factors on aggregate student variables. This work employs 
multiple statistical measures to mitigate these issues. The details 
are explained in the below sub-sections. 


3.2.1 Normalized Login Volume 

Earlier studies identified that student performance prediction is 
strongly dependent on the volume of student logins. One 
challenge with counting the student logins in Blackboard is its 
inability to find which course they accessed during each login. 
Also, calculating the total login count introduces a hidden bias as 
courses with more content on Blackboard prompt students to login 
more often than other courses with less content and flexible 
deadlines. To mitigate this issue, our work followed the below 
steps to extract student login features. 


1. Extract all courses enrolled by all students in IS and CS. 


2. Count the total number of logins for all students 
irrespective of their department in these extracted 
courses. 


3. Calculate the Z-scores of student logins in each course. 
The reason for doing this is to mitigate the bias 
introduced by variations in the absolute count of logins 
as course logins vary a lot between students. Z-scores 
provide a value that helps understand if student logins 
are higher or less than average logins in a specific 
course. 
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4. Once the z-scores are calculated for all courses, we 
extract a vector of login z-scores for each student based 
on their enrolled courses. 


5. As predictive models do not take vectors of variable 
length as input, this work extracts seven significant 
statistics from the login vector: mean, median, 
minimum, maximum, standard deviation, skewness, and 
kurtosis. 


3.2.2 Login Regularity 

Apart from student login volumes, the regularity between logins 
also provides valuable insights into student achievement as 
regularity is related to self-regulation capabilities. In this work, 
we utilize an entropy-based method to extract features that define 
student login regularity in each course. In information theory, 
entropy is used to define uncertainty or randomness [48]. Entropy 
measure will explain if student's logins are regular (less random) 
or irregular (more random). Based on this concept, if the entropy 
value is high, then a student has an irregular login pattern, and if 
the entropy value is low, the student has a regular login pattern. 
The steps to calculate student regularity features are given below. 


1. Extract all course accesses with timestamps for every 
student in IS and CS. 


2. Calculate the difference between timestamps. This 
difference will give a vector of time intervals for each 
course enrolled by a student. 


3. Calculate entropy using the KL estimator with the k- 
nearest neighbor method proposed by Kozachenko and 
Leonenko [45]. KL estimator uses k-nearest neighbor 
distances to compute the entropy of distributions. The 
reason for adopting this method instead of Shannon 
entropy is based on the time interval vector's continuous 
characteristic [46]. 


4. Once the entropies are calculated, we get a vector of 
entropies for each student based on the number of 
enrolled courses. We then calculate the seven statistics 
similar to student logins: mean, median, minimum, 
maximum, standard deviation, skewness, and kurtosis. 


3.2.3. Login Chronotypes 

Studies in chronobiology and chronopsychology showed variation 
in different individual active periods at different times of the day 
[41, 42]. These studies classify an individual into either morning 
type or evening type based on their high activity time. For 
example, if an individual is highly active in the morning 
compared to the evening, they are considered morning type and 
vice versa. Inspired by this work in human psychology, this work 
divides a day into four-time bands T1 (12 AM to 6 AM), T2 (6 
AM to 12 PM), T3 (12 PM to 6 PM), and T4 (6 PM to 12 AM) 
and extract student logins based on these four time bands. In 
addition to this, this work also extracts the logins on weekdays 
and weekends to study their influence on student performance. 


1. Count the number of logins during each time band and 
on weekdays and weekends for each course. 


2. Calculate the mean of login count vector for each of 
these time bands and weekday/weekend. 


3. Normalize the login count with the number of courses 
enrolled by an individual student. This normalization 


will mitigate the bias introduced by the number of 
courses enrolled across the student cohort. 


This work also utilizes the demographic and prior performance 
measured by GPA features captured by the SIS system. These 
features were listed in below table 2. 


Table 2. Student demographic features 


Demographic Values 


Cumulative GPA available 


Start GPA (Prior Performance) Hil the senor semester 


Gender Male & Female 


Ethnicity White, Asian & Minority 


Freshman, Sophomore, 


tudent Year : ; 
oar e Junior & Senior 


Admit Type Regular & Transfer 
Enrollment Type Full time & Part time 
Student Age Continuous variable 


4. METHODOLOGY 

The methodology section details the predictive modeling 
approach to predict student end-of-term GPA in fall 2019. In 
addition to this, we also describe the correlation-based LIME 
method to explain the features that contribute to model 
predictions. The workflow of developing student-centric models is 
depicted in figure 1. 


4.1 Predictive Modeling 


This work studied five of the most common regression models for 
comparison purposes. The selected models include Generalized 
Linear Model (GLM), Decision Tree (DT), Support Vector 
Regressor (SVR), Random Forest (RF), and Gradient Boosted 
Regressor (GBR). As model hyperparameter influences their 
predictive performance, we utilized a grid search mechanism to 
select multiple parameters to predict with high accuracy. We also 
adopted a feature selection method based on a multi-objective 
evolutionary algorithm in addition to hyperparameter search. This 
feature selection algorithm evaluates each feature set based on 
pareto-optimal that balances model complexity and accuracy. The 
details of models and hyperparameter search criteria are discussed 
below. 


Generalized Linear Model: GLM is an extension of traditional 
linear models that fits input data by maximizing the log- 
likelihood. The regularization parameter is set so that the 
hyperparameter search space looks for an alpha value that fits 
between ridge and lasso regression. An alpha value of | represents 
lasso regression, and an alpha value of 0 represents ridge 
regression. This study searched for the best alpha value using a 
grid search between 0 and 1 in increments of 0.1. 


Decision Tree: The decision tree algorithm is a collection of 
linked nodes intended to estimate the numerical target variable. 
Each node in the tree represents a rule used to split on an attribute 
value. The node uses a least-squares criterion to minimize the 
squared distance between the average value in a node when 
compared to the actual value. The hyperparameter search space 
for this algorithm evaluates both maximal depth and pruning. The 
maximal depth value varies between | and 100 in increments of 
10. Pruning will make the DT algorithm use multiple criteria like 
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Figure 1: Student-centric Model Workflow 


minimal gain, minimal leaf size, and pruning alternatives to 
decide the stopping criterion. 


Support Vector Machines: The SVM used in this study is built 
based on Stefan Reupping’s mySVM [47]. This algorithm will 
construct a set of hyperplanes in a high dimensional space for 
regression tasks. A good hyperplane is decided based on the 
functional margin. The hyperparameter search space focused on 
both dot and radial kernel functions with a C (SVM complexity) 
value range between 10 and 200. The kernel gamma function is 
set for a radial kernel with a range of 0.005 and 5 with three 
logarithmic increments. 


Random Forest: A RF model builds an ensemble of decision trees 
on bootstrapped datasets. The splitting criteria are similar to a 
decision tree. The regression outcome is the average of the 
observed train data GPA present at that end node. We only tuned 
the number of trees hyperparameter to reduce the time complexity 
of the execution. The number of tree searches varied between 10 
and 1000 trees in 10 linear steps. 


Gradient Boosted Tree: The GBT model builds multiple 
regression trees in a sequence by employing boosting method. By 
sequentially applying weak learners on incrementally changed 
data, the algorithm builds a series of decision trees that produce 
and an ensemble of weak regression models. As GBT is a non- 
linear model, we search hyperparameters related to the number of 
trees, learning rate, and maximal depth. The number of tree values 
varies between 1 and 1000 in five quadratic increments, the 
learning rate varies between 0.001 and 0.01 in five logarithmic 
increments, and the maximal depth parameter varies between 3 
and 15 in three logarithmic increments. 


4.2 LIME Explanation 


The concept of Locally Interpretable Model Explanations (LIME) 
was introduced to explain the predictions made by black-box 
models that deal with classification problems. LIME explains 
each prediction made by a complex model by training a surrogate 
model locally [35]. However, this earlier methodology is not 
scalable to deal with categorical variables, tabular data, and 
regression problems. In this work, we adopt the correlation-based 
LIME method available in RapidMiner to explain machine 
learning models' predictions [36, 37, 38]. 


1. Perturb data in the neighborhood of each sample in the 
dataset. The number of simulated samples can be user- 
defined. A higher number of simulated samples will 
provide higher accuracy of explanations but at the cost 
of more run times. 


2. Make predictions using the ML model for all the 
simulated samples around each original sample in the 
dataset. 


3. Calculate the correlation between each feature in the 
dataset and the target variable. 


4. The features that have a positive correlation are 
considered supporting features, and features with 
negative correlation with predicted outputs are referred 
to as contradicting features. 


As LIME provides feature importance value for each feature at 
each sample, we aggregate the importance value for all samples to 
build global importance for each variable. The significant 
advantage of this method compared to traditional global 
importance methods is its flexibility. As model global 
importance’s are calculated across all samples in the data, the 
LIME based feature importance’s can be calculated for subsets of 
data. This flexibility provides users with a deeper understanding 
of each feature's role for different sets of populations present in a 
dataset. 


In addition to applying the LIME methodology, this work also 
studies univariate and multivariate feature importance on student 
performances by applying correlation and linear regression 
methods. The student dataset used in this study is divided into 
multiple subsets containing different student groups based on 
various demographics. A correlation value is calculated between 
input features and student end-of-term GPA. This value provides 
us with an intuition about the impact of various features on 
student performances related to different demographics. As 
correlation only provides independent variable importance on 
student performance, we also adopt a linear regression model to 
explore the variation of feature importance based on coefficient 
values. Applying a linear regression model will also consider the 
interaction effect between input features to fit the outcome 
variable. 
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5. RESULTS 


This results section is divided into three subsections based on the 
three research questions we are focusing on in this study. The first 
subsection will detail various predictive models' performance on 
longitudinal student interaction data collected during the fall 2019 
semester. The second subsection will detail the importance of 
student logins and regularity on performance predictions based on 
LIME methodology. The final subsection will discuss the 
importance of input features based on correlation and regression 
methods. 


5.1 How different student-centric machine 
learning models perform in predicting student 
end-of-term GPA? 


The five machine learning models adopted in this study were 
evaluated using a five-fold cross-validation method. In this 
method, the student data is divided into five equal folds at a 
student level. In every iteration, four of the five folds are used for 
model training, and one fold is used for model testing. The 
machine learning models are evaluated based on two performance 
metrics: R squared (R‘2) and Root Mean Squared Error (RMSE). 
The output performance metrics are the average of five test fold 
performances. 


GBT model R Squared Metrics 
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Figure 2. Compare performances of GBT model on 
different longitudinal datasets 


In this study, we divided a semester into four parts to understand 
the impact of longitudinal interaction data across the semester on 
predictive model performances. This analysis will support the 
amount of data needed to balance predictive performance and 
early detection for interventions. The performance metrics 
evaluated on these four cumulative datasets will help understand 
the amount of student data needed to make accurate predictions. 
Tables 3, 4, 5, and 6 present the machine learning models' results 
evaluated on four cumulative datasets. While differentiating 
student performance based on multiple longitudinal datasets, we 
also study algorithms' performance without Freshman student 
data. This differentiation is to study the impact of missing start 


GPA feature values for first-year students as most of the full-time 
regular students in US universities start in the Fall semester. 


Table 3. Student features from start to end of first month 


Model R‘2 RMSE 
All Except All Except 
Students Freshman Students Freshman 
GLM 0.213 0.249 0.657 0.633 
DT 0.266 0.270 0.638 0.633 
SVM 0.216 0.324 0.666 0.607 
RF 0.332 0.353 0.607 0.588 
GBT 0.338 0.362 0.602 0.581 


Table 4. Student features from start to middle of semester 


Model R42 RMSE 
All Except All Except 
Students Freshman Students Freshman 
GLM 0.257 0.266 0.67 0.628 
DT 0.263 0.295 0.67 0.618 
SVM 0.195 0.315 0.705 0.609 
RF 0.360 0.352 0.621 0.591 
GBT 0.362 0.361 0.622 0.586 


Table 5. Student features from start to end of third month 


Model R‘2 RMSE 

All Except All Except 
Students Freshman Students Freshman 

GLM 0.25 0.266 0.644 0.626 
DT 0.255 0.255 0.658 0.650 
SVM 0.335 0.344 0.612 0.597 
RF 0.371 0.386 0.589 0.575 
GBT 0.374 0.386 0.588 0.572 


Table 6. Student features from start to end of semester 


Model R‘2 RMSE 
All Except All Except 
Students Freshman Students Freshman 
GLM 0.251 0.269 0.644 0.625 
DT 0.246 0.274 0.657 0.641 
SVM 0.320 0.289 0.616 0.627 
RF 0.387 0.410 0.585 0.564 
GBT 0.400 0.406 0.575 0.562 


From the above tables, we observe that the GBT model performed 
better than the other four models based on the tradeoff between R 
squared and RMSE values. We also observe that there is no 
significant difference in student end-of-term GPA prediction with 
and without freshman details. This might be due to less sample 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 269 


GBT Model R Squared Metric for all students with different feature sets 


Start GPA 


Start GPA + 
Norm. Login volume 


Start GPA + 
Login Regularity 


Feature Sets 


Start GPA + 
Norm. Logins Volume + 
Login chronotypes 


Start GPA + 

Norm. Logins Volume + 
Login chronotypes + 
Login Regularity 


Start GPA + 
Login chronotypes 


0.00 0.05 0.10 


0.15 0.20 0.25 0.30 0.35 
R Squared Score 


Figure 3. Compare performances of GBT model on different input feature sets. 


size (7%) related to freshman cohort. From figure 2, it is also 
evident that there is a gradual increase in the performance of GBT 
model as we add data to predictive models as the semester 
progresses. Even though there is an increase in performance if we 
add all data captured during the semester, it doesn’t help much for 
real-world interventions as activities that effects student 
performances will be completed by the end of the semester. Based 
on this understanding, we focus on data captured until the middle 
of the semester for feature importance study. 


5.2 How do student login and time interval 
pattern across courses influence student 


learning outcomes? 

To answer research question 2, we adopted a stepwise feature 
addition study that inputs features by adding one by one into the 
model and evaluates the performance based on R square and 
RMSE values. This study is performed on student data collected 
until the middle of the semester as models developed during this 
stage will help identify student level indicators and give enough 
time to deploy interventions that improve student performance. 
We first start with inputting student Start GPA (Cumulative GPA 
till the start of Fall 2019 semester) as start GPA showed a high 
correlation with end-of-term GPA based on our preliminary 
analysis. We then add normalized login volumes, login regularity, 
and login chronotypes in a step by step method. Figure 3 shows 
the R squared performance metric of student-centric models with 
different input variables. 


From figure 3, we observe that students start GPA with 
normalized student login volumes across courses adds more 
predictive power to machine learning models. This observation is 
also supported by earlier studies [39, 40] that showed the 
importance of student login counts on student course grades and 
score predictions. Another observation is related to the importance 
of adding student self-regulation capability based on login 
regularity measured using entropy statistic. Based on figure 3, we 
observe that adding login regularity features with student login 
features and start GPA adds slightly more predictive power 
compared to model with only login regularity and start GPA 
features. In addition to these observations, we also observed that 
login counts based on login chronotypes with start GPA did not 


add much predictive power to machine learning models. From 
these results, we also imply that student aggregated login volumes 
might be adding the same information as login chronotypes. 


5.3 Is there a significant variability in feature 
importance’s for students coming from diverse 
demographics? 

One limitation of using the earlier mentioned model-based feature 
importance study is its inability to explain each feature's 
importance on different student cohorts. To address this issue and 
understand the importance of login volumes and regularity 
features on different student groups, we adopt three approaches: 
one based on LIME, the second based on correlation analysis, and 
the third based on linear regression. 


5.3.1 LIME based importance’s 

LIME based approach extract feature importance at the local 
level, also called local fidelity. By applying the LIME method 
explained in the methodology section, we extract feature 
importance’s for different student groups categorized based on 
their demographics. 


From figure 4, we can observe that cumulative student GPA at the 
start of the semester is an important feature to predict student end- 
of-term GPA. Student login volumes are the second important 
feature set for model predictions on different student 
demographics. This study's focus is also on student self-regulation 
capability measured by the regularity of logins (entropy). We 
observe that for students with GPA values less than 2, the 
regularity of logins feature played a key role compared to a 
student with a higher GPA. This observation also holds for 
students from minority ethnicity. One implication from these 
observations This observation suggests that introducing teaching 
practices that guide LMS use and time management will 
significantly impact students with low GPA and from a minority 
race. Start GPA played a slightly less significant role in transfer 
students than regular students as transfer students join in different 
years and their cumulative GPA might not be available at the start 
of the semester, similar to freshman. 
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Figure 4. LIME importance’s for different student groups divided based on GPA, ethnicity, admit type and gender. 


Even though there is a huge imbalance in the number of male and 
female students present in the dataset, we do not observe any 
significant difference in feature importance’s between these two 
genders. One limitation of the LIME method is related to global 
importance’s. The importance’s showed by LIME at the local 
level do not necessarily correspond to global importance’s. Based 
on this limitation, we can infer which feature is essential for 
different students' groups but not quantify them as_ the 
importance’s calculated in this study are the aggregate of 
importance’s provided by LIME for each individual student. 


5.3.2 Correlation based Feature Importance’s 

As earlier feature importance methods showed a significant 
impact of login volumes and login regularity measured by entropy 
statistic to predict student performance, we adopt Pearson 
correlation statistic to infer this relationship for different student 
groups. To do this, we create subsets of student data based on 
different groups: student GPA, gender, ethnicity, and admit type. 


From figure 5, we observe that the student logins count and 
regularity in logins is highly significant for a student with a GPA 
lower than 2. We can also observe that as the entropy increases, 
the GPA reduces. This observation holds true as regularity in 
student logins represents their self-regulation capabilities. Earlier 
research showed that students with good self-regulation 
capabilities perform better in class [49, 50]. For other student 
groups divided based on gender and admit type, there is no 
significant variation in the importance of logins and entropy on 
student performances. 


Even though the absolute values of correlation observed in figure 
5 are not very strong, the comparison between different groups 
helps understand which features are significant for students from 
different demographics. In addition to this, we also observe a 
similar pattern in LIME based importance’s discussed in earlier 
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Figure 5. Correlation values for different student groups 
divided based on GPA, ethnicity, admit type and gender. 


sections. We can infer that LIME based method also scales well 
for global feature importance in this study. 


5.3.3 Regression Modeling for Feature Importance 

One significant limitation of earlier methods is their inability to 
capture interaction effects as feature importance might change in 
the presence of other features. To study the interaction effects, we 
apply a linear regression model on different categories of student 
login data collected till the middle of semester. These student 
categories were divided based on GPA, gender, admit type and 
ethnicity of students. Even though linear regression models are 
applied on all features discussed in earlier sections, we only report 
the coefficients of median login volume and mean login regularity 
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in table 7, as these variables are the focus of this study. From table 
7, we observe that login volumes, and login regularity features are 
following similar direction for students with lower GPA and 
students from minority ethnic backgrounds as observed in the 
LIME and correlation based analysis. There are some 
discrepancies in other observations as there is no statistical 
significance (high p values) for coefficients in these cases. 
Another reason for focusing on student from these two groups is 
their higher attrition rates found in earlier studies [43, 44]. 
Studying these groups closely will help develop targeted 
interventions in the future. 


Table 7. Regression coefficients (Significance marked with *) 


Student Median Mean Login 
: Student : ; 
Demographic Groups Logins Regularity 
P Coefficient Coefficient 
GPA <=2 0.171* -0.398* 
GPA oe 0.013 0.200 
<=3 
GPA >3 0.065 -0.002 
Male 0.135 0.130 
Gender 
Female -0.021 -0.157 
Regular 0.399 -0.004 
Admit Type 
Transfer 0.611 0.191 
White 0.204 -0.029 
Ethnicity Asian -0.085 0.201 
Mey 0.201* -0.153* 
Race 
6. DISCUSSION & CONCLUSION 


There is a growing interest in building models that capture student 
behavioral patterns while using LMS systems to predict their 
performance. Earlier research showed that building efficient 
models based on LMS data to predict student performances is not 
a simple task as multiple learning and demographic factors impact 
student learning processes. Although earlier research in EDM and 
LA tried to address different issues related to student performance 
tracking, there is still a gap in developing models that accurately 
predict overall student performance and explain underlying 
factors that improve their academic performance. As a step in this 
direction, this study presents a student-centric modeling approach 
based on aggregated LMS features to predict and explain the 
reasons behind varying student performances. This context is both 
relevant and timely given the increase of LMS adoption and a 
need for efficient and interpretable model development. 


6.1 Key Contributions 


One primary contribution in this study is the development of 
student-centric models on aggregated student LMS login data that 
are least biased towards the diverse course contents and instructor 
teaching styles. Using the feature extraction methods developed in 
this study, we were able to build efficient GBT model that is able 
to predict student end-of-term GPA with an average R squared of 
0.37 across the semester. Furthermore, models built at different 
durations of a semester showed only slight improvement in 
predictive performance after crossing a specific duration (middle 
of the semester). This observation helps develop models in the 


middle of the semester to estimate student performance 
accurately. 


In addition to developing student-centric models, this study also 
focused on understanding the impact of various LMS features on 
student performances. Earlier studies in this domain primarily 
focused on volume of logins. In this work, we also studied the 
impact of login regularity measured by entropy statistics on 
student performance by implementing LIME explanation, 
correlation, and linear regression methods. From our 
interpretation studies, we observed that students who login 
regularly into the LMS system have a positive relationship with 
performance improvement. This observation is highly significant 
for underperforming students (GPA < 2) and students from 
minority races. 


We also found no significant difference in the impact of LMS 
features on Male and Female students. This observation is valid as 
LMS features used in this study are captured objectively rather 
than subjectively. This observation also holds for regular and 
transfer students. 


Our study also extracted student interaction features based on 
concepts in chronobiology and chronopsychology to understand if 
there is a student performance variation based on different 
chronotypes. From the results, we observed no significant 
difference in performance. The impact of these features is 
negligible in the presence of aggregated student login volume. 


6.2 Applications & Limitations 

Student performance tracking is a complex process as it depends 
on multiple dimensions and facets. Developing student-centric 
models to predict student performance models helps student 
counselors and educational administrators design student level 
interventions that attract students' attention. Also, developing 
predictive models that estimate students' overall performance in 
the middle of the semester will make them aware of their 
predicted end-of-term performance. These predictions might act 
as an external intervention to improve their performance in the 
remaining part of the semester. By understanding the difference in 
the impact of LMS features on students from different 
demographics, researchers and administrators can build more 
personalized instructional methods that are suitable for diverse 
student cohorts. 


There were also some limitations in this study. The predictive 
performance achieved by using aggregate features across different 
courses enrolled by students is moderate at best. It would be more 
helpful to explore ways to improve the performance of these 
models. One possibility is to add other features that target 
independent content access durations, mid-semester assessments, 
and other external factors. One major challenge that needs to be 
addressed in our future studies is to find an effective method to 
aggregate content level features across different courses enrolled 
by a student. The dataset used in this study is extracted in a single 
semester and students from two departments that are closely 
related to each other. To understand if the findings in this study 
are scalable to other undergraduate students, we will extend these 
models to students from various departments in the university. 


To conclude, we built student-centric models to predict student 
performances that supports the development of student level 
interventions. We then use the LIME explanations to study LMS 
features' importance on student performance prediction. Finally, 
we study the univariate and multivariate feature importance’s 
using correlation and regression methods and assess them with the 
feature importance’s extracted in LIME method. 
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ABSTRACT 


Access to high-quality education at scale is limited by the 
difficulty of providing student feedback on open-ended as- 
signments in structured domains like programming, graph- 
ics, and short response questions. This problem has proven 
to be exceptionally difficult: for humans, it requires large 
amounts of manual work, and for computers, until recently, 


achieving anything near human-level accuracy has been unattain- 


able. In this paper, we present generative grading: a novel 
computational approach for providing feedback at scale that 
is capable of accurately grading student work and providing 
nuanced, interpretable feedback. Our approach uses gen- 
erative descriptions of student cognition, written as proba- 
bilistic programs, to synthesise millions of labelled example 
solutions to a problem; we then learn to infer feedback for 
real student solutions based on this cognitive model. 


We apply our methods to three settings. In block-based cod- 
ing, we achieve a 50% improvement upon the previous best 
results for feedback, exceeding human-level accuracy. In two 
other widely different domains—graphical tasks and short 
text answers—we achieve improvements over the previous 
state of the art by about 4x and 1.5x respectively, approach- 
ing human accuracy. In a real classroom, we ran an exper- 
iment with our system to augment human graders, yielding 
doubled grading accuracy while halving grading time. 


Keywords 
Generative models, automated feedback, Idea2Text, proba- 
bilistic programs, grammars, Zipf distribution, zero shot. 


1. INTRODUCTION 


Enabling global access to high-quality education is a long- 
standing challenge. The combined effect of increasing costs 
per student |3] and rising demand for higher education makes 
this issue particularly pressing. A major barrier to provid- 
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Figure 1: Students solve problems by making decisions that re- 
sult in their final solution (generative model). Providing feedback 
requires the reverse task of seeing a solution and inferring the stu- 
dent decisions that lead to this solution (inference model). 


ing quality education has been the ability to automatically 
provide meaningful and accurate feedback on student work. 


Learning to provide feedback on richly structured problems 
beyond simple multiple-choice has proven to be a hard ma- 
chine learning problem. Five issues have emerged, many of 
which are typical of human-centred AI problems: (1) stu- 
dent work is extremely diverse, exhibiting a heavy tailed 
distribution where most solutions are rare, (2) student work 
is difficult and expensive to label with fine-grained feedback, 
(3) we want to provide feedback (without historical data) for 
even the very first student, (4) grading is a precision-critical 
domain with a high cost to misgrading, and (5) predictions 
must be explainable and justifiable to instructors and stu- 
dents. Despite extensive research using massive education 
data [14], these issues make traditional 


supervised learning inadequate for automatic feedback. 


Human instructors are experts at providing feedback. When 
grading assignments, they have an understanding of the de- 
cisions and missteps students might make when solving a 
problem, and what corresponding solutions these choices 
would result in. For example, an instructor understands 
that a student wanting to repeat something in a program- 
ming assignment might use a for loop or manually write 
out repeated statements. And given that the student uses a 
loop, their loop could be correct or off-by-one. 
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In essence, instructors mentally possess generative models 
of student decision-making and how these decisions mani- 
fest in a final solution (Fig. }1| forwards). When providing 
feedback, the instructor does inference: given a student so- 
lution, they use their mental model to try and determine 
the underlying student decisions that could have resulted in 
this solution (Fig. |1} backwards). 


In this paper, we propose an automated feedback system 
that mimics the instructor process as closely as possible. 
Firstly, the system elicits a literal generative model from 
an instructor in the form of a concrete student simulator. 
Secondly, it uses deep neural networks and a novel inference 
method with this simulator to learn how to do inference on 
student solutions, without using any labelled data. Finally, 
the inference model is used to provide automated feedback 
to real student solutions. We call this end-to-end approach 
generative grading. 


When used across a spectrum of public education data sets, 
our automated feedback system is able to grade student work 
with close to expert human-level fidelity. In block-based cod- 
ing, we exceed human-level accuracy, achieving a 50% im- 
provement over the previous best results for feedback. In two 
other widely different domains—graphical tasks and short 
text answers—we achieve improvements over the previous 
state of the art by about 4x and 1.5x respectively, approach- 
ing human accuracy. We used our system in a real classroom 
to augment human graders in a CS1 class, yielding doubled 
grading accuracy while halving grading time. 


1.1 Main contributions 

In Sec. |4| we present an easy-to-use and highly expressive 
class of generative models called Idea2Text simulators that 
allow an instructor to encode their mental models of stu- 
dent decision-making. These simulators can succinctly ex- 
press student decisions and how these decisions manifest in a 
final solution for a broad set of problem domains like graph- 
ics programming, short-answer questions, and introductory 
programming in Java. We provide a Python implementation 
that allows any instructor to easily write these simulators 


In Sec. we show how to use Idea2Text simulators with 
deep neural networks to infer students’ decision processes 
from their solutions. This extracted decision process is a 
general representation of a student’s solution and can be 
used for several downstream tasks such as providing auto- 
mated feedback, assisting human grading, auditing and in- 
terpreting the model decisions, and improving the quality of 
the simulator itself. 


In order to do inference successfully on our expressive class 
of simulators, we must overcome several interesting technical 
challenges (Sec. [5h. Learning to map solutions to sequences 
of decisions specified by the simulator is a nonstandard ma- 
chine learning task, with non-fixed labels, varied sequence 
lengths, and unexpected trajectories. Moreover, generating 
simulated training data from the simulators requires an in- 
telligent sampling method to work effectively. 


‘All code publicly available at: (https://github.com/ 
malik-ali/generative-grading 


In Sec. [6] we show the efficacy of our approach in practice on 
a diverse set of richly structured problems. We attain close 
to human-level accuracy on providing feedback and surpass 
many previous state of the art results. We also discuss sev- 
eral interesting extensions in Sec. [7] that use our system to 
go beyond just providing automated feedback. 


The generative grading system is powerful because it ad- 
dresses many of the issues of traditional supervised learning 
mentioned above. We find that the cost of writing simula- 
tors for a new assignment is orders of magnitude cheaper 
for instructors than manually annotating individual student 
work. The simulators allow us to sample infinite data, and 
our adaptive sampling strategy lets us explore diverse stu- 
dent solutions in our training data. It is “zero-shot”, requir- 
ing no historical data nor annotation, and thus works for 
the very first student. Moreover, our novel inference system 
allows for interpretable and explainable decisions. 


2. RELATED WORK 

“Rubric sampling” first introduced the concept of encod- 
ing expert priors in grammars of student decisions, and was 
the inspiration for our work. The authors design Probabilis- 
tic Context Free Grammars (PCFGs) to curate synthetically 
labelled datasets to train supervised classifiers for feedback. 
Our approach builds on this, but presents a more expressive 
family of generative models of student decision-making that 
are context sensitive and comes with new innovations that 
enable effective inference. From our results on Code.org, we 
see that this expressivity is responsible for significant im- 
provements in our model’s performance. Furthermore, this 
prior work only used to PCFGs to create simulated datasets 
of feedback labels for supervised learning. In contrast, we 
learn to infer the entire decision trajectory of a student solu- 
tion, allowing us to do things like dense feedback and human- 
in-the-loop grading. 


We draw theoretical inspiration for our generative grading 
system from Brown’s “Repair Theory” which argues that the 
best way to help students is to understand the generative 
origins of their mistakes [4]. Building systems of student 
cognition has been used in K-12 arithmetic problems 
and subtraction mistakes [8]. 


Automated feedback for open-ended richly structured prob- 
lems has been studied through a few lenses. In many ap- 
proaches, traditional supervised learning is employed to map 
solutions to feedback [21] [30} [I] [33]. These methods require 
large hand-labelled datasets of diverse student solutions, 
which is difficult due to heavy-tailed distributions. Feed- 
back specific to computer programming problems has been 
explored based on executing student solutions and compar- 
ing to a reference solution [13] [19]. An interesting parallel 
to our work is found in [13], where the instructor is asked 
to specify the kinds of mistakes students can make. These 
approaches are limited to code and don’t provide feedback 
on the problem-solving process of a student. 


Extracting expert-written generative models for inference 
has seen enormous use in fields where domain expertise is 
critical. Some key example include medical diagnosis, engi- 
neering, ecology, and finance, where a generative model like 
a Bayesian network is elicited from experts. [20] [22]. In ed- 
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ucation, instructors have domain expertise about students, 
and Idea2Text serves as an easy-to-use generative model for 
instructors to encode this expertise. 


Inference over decision trajectories of Idea2Text simulators 
is similar to “compiled inference” for execution traces in 
probabilistic programs. As such, our inference engine shares 
similarities to this literature [25] (T7]. With Idea2Text simu- 
lators, we get a nice interpretation of compiled inference as 
a parsing algorithm. 


3. BACKGROUND 


In this section, we introduce the feedback challenge and 
what makes it a difficult machine learning problem. 


3.1 Feedback as a Prediction Problem 


The feedback prediction task is to automatically provide feed- 
back to a given student solution. While this is easy to do for 
simple multiple-choice problems, we focus on the challeng- 
ing task of providing feedback on richly-structured problems 
like computer programming or short-answer responses. 


Both the type of student solutions and the type of feedback 
required for the task can take many forms. A student solu- 
tion can be a piece of text, which could represent problems 
like an essay, a maths proof, or a code snippet. It could 
also be graphical output in the form of an image. Similarly, 
feedback can take the form of something simple like classify- 
ing solutions to a fixed set of misconceptions, or something 
complex such as highlighting and annotating specific parts 
of a student solution 


3.2 Difficulty of Automated Feedback 


Feedback prediction on richly structured problems has been 
an extremely difficult challenge in education research. Even 
limited to simple problems in computer science like beginner 
block-based programming, automated solutions to providing 
feedback have been restricted by scarce data and lack of ro- 
bustness. We discuss a few of the properties of student work 
that make predicting feedback such a difficult challenge. 


(1) Heavy-tailed Distributions: Student work in the form 
of natural language, mathematical symbols, or code follow 
heavy-tailed Zipf distributions. This means that a few so- 
lutions are extremely common whereas almost all other ex- 
amples are unique and show up rarely. Fig. |2| plots the 
log-frequency of unique examples against the log of the rank 
across four datasets of student work in block-based program- 
ming code, Java code, and free response. For all datasets, 
we observe a linear relationship in log-log space, which is a 
characteristic property of Zipf distributions. 


These heavy-tailed Zipf distributions pose a hard generali- 
sation problem for traditional supervised machine learning: 
a handful of similar examples appear very frequently in the 
training data whereas almost all other examples are unique. 
This means at test time, examples are likely to introduce un- 
seen tokens, new misconceptions, and novel student strate- 
gies. In a Zipf distribution, even if we observe a million 
student solutions, there is roughly a 15% chance that the 
next student generates a unique solution. 
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Figure 2: Student solutions (across many domains) exhibit 
heavy-tailed Zipf distributions, meaning a few solutions are ex- 
tremely common but all other solutions are highly varied and 
show up rarely. This suggests that the probability of a student 
submission not being present in a dataset is high, making super- 
vised learning on a small data set difficult. 


(2) Difficulty of Annotation: Annotating student work with 
feedback requires an instructor-level expertise of the domain. 
Providing fine-grained feedback also takes effort to read and 
understand student solutions before inferring possible mis- 
conceptions. In [32], the authors found that it took 26 hours 
to label 800 student solutions to block-based programming. 


This difficulty, combined with Zipf properties, makes super- 
vised learning challenging. As an example, in 2014, Code.org, 
a widely used resource for beginners in computer science, 
ran an initiative to crowdsource thousands of instructors 
to label 55,000 student solutions in a block-based program- 
ming language. Yet, despite having access to an unprece- 
dented amount of labelled data, traditional supervised meth- 
ods failed to perform well on even these “simple” questions. 


(3) Limitations on Data Size: Even if the Code.org approach 
succeeded, most classrooms do not share the same scale as 
Code.org. A method that relies heavily on historical data is 
not widely applicable in the average classroom setting. In 
our experiments, our data sets contain less than a few hun- 
dred examples, again disqualifying the application of super- 
vised algorithms. The ideal feedback model will be zero-shot 
so that it works even for the very first student. 


4. MODELLING STUDENT COGNITION 


Having presented the feedback challenge and motivated why 
supervised learning cannot solve this problem alone, we dis- 
cuss the idea of modelling the cognitive process of student 
decision-making when producing a solution. 


When an instructor provides feedback on student work, they 
often have a latent mental model of student decision making; 
this captures the kinds of steps and mistakes they think 
students will make and what solutions are indicative of those 
steps. The instructor (or TAs) then grade solutions, one at 
a time, by essentially inferring the steps in the decision- 
making model that lead to the produced solution. 


As aconcrete example from introductory programming, sup- 
pose a student is trying to print a countdown of numbers 
from 10 to 1. An instructor understands that as a first step, 
a student might use a for loop or manually write ten print 
statements. Given that the student uses a loop, they could 
increment up or down. And given that they increment down, 
their loop comparison could be correct or off-by-one. At each 
of these decision points, the instructor can conceive how a 
specific choice would manifest in the solution. 
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We want to allow instructors to capture their mental model 
of student decision-making for a given problem and distil 
it into a concrete executable simulator that generates solu- 
tions. Such a generative model could be used to simulate 
unlimited student data, including their decision process and 
resulting solutions. This simulated dataset can then be used 
for learning how to provide feedback. 


In this section, we formalise the idea of a student’s decision 
process for generating the solution to a problem and discuss 
how we can represent the instructor’s latent model of this 
generative process as a concrete simulator. 


4.1 Student Decision Process (SDP) 


A student’s decision process (SDP) can be seen as the se- 
quence of choices the student makes while solving a prob- 
lem, and how those choices correspond to different outputs. 
The sequence is intended to reflect all of the critical deci- 
sions made by the student, and in particular those a teacher 
would like to recover from the solution. Importantly, not all 
students will encounter the same sequence of decisions since 
an early decision choice may determine which decisions are 
faced later in problem solving. 


We can formally think of a decision point as a categorical 
random variable, X, and the specific choices that can be 
made as the different values, x € Val(X), that the random 
variable can take. The decision process of a student can be 
seen as a sequence trajectory (Xt, ee)FL1, of decisions en- 
countered and made by the student in solving the problem. 


Under this interpretation, specifying a model for an SDP 
amounts to defining the space of possible decision trajecto- 
ries and a probability distribution over this space. By the 
chain rule for probabilities, we can decompose this over- 
all distribution as a sequence of conditional probabilities 
P(Xt = ve | Xet = X<r). Here x: is the choice made at step 
t from domain X; and X <; is shorthand for the sequence of 
decisions before time t. 


We want to make this general formulation more tractable 
to allow us to specify useful SDPs as generative models we 
can sample from. Prior work has attempted to express 
SDPs by restricting the class of generative models to prob- 
abilistic context free grammars (PCFGs). They found that 
instructor-written PCFGs could often be used to emulate 
student problem-solving and generate student solutions for 
small problems. In this setting, the non-terminal nodes of 
the PCFG represent decisions to be made (e.g. syntax con- 
fusion) and the production rules represent how decisions are 
made and manifested into output (e.g. the code is missing 
a semicolon). Instructors create student simulators by spec- 
ifying decision points, rules, and probabilities for each rule 
(e.g. missing a semicolon is much more likely than missing 
a function statement). 


A PCFG is compact and useful, but makes the indepen- 
dence assumption that that the choice made at time ¢ is 
independent of past choices made while solving the problem 
ie. P(X = a | Xet = Xt) = P(X = 21). This context- 
independence is a strong restriction that severely limits what 
instructors can express about the student decision-making 
process and fails to faithfully model student reasoning. As 


Algorithm 1 Idea2Text Simulation 


Input: Idea2Text simulator (D, 4, S) 
Output: Tuple (7, y) of decision trajectory and output solution. 
1: procedure SIMULATE(D, %, S) 
TH [] 
y + GENERATE(S,7T) 
return (7, y) 
procedure GENERATE(N, 7) 
a, Xa,Ila + N 
La,y + Ila(Xa,T) 
T.append((a, ta)) 
for decision node N’ in y do 
y’ < GENERATE(N’,7) 


y < REPLACE(y, N’, y’) 
return y 


> Begin from start node 


> Unpack current decision node 
> Get decision choice and output 


> In order left to right 


> Replace N’ with y’ 


can be seen in even the simple countdown example above, 
the off-by-one error would manifest differently in student 
output depending on whether the student chose to incre- 
ment up or down. Thus, context dependence of decision 
making is an important property to model. 


4.2 Idea2Text 


In this section, we define a broader class of generative mod- 
els that is powerful enough to capture more complexities of 
expert models of student cognition. Similar to PCFGs we 
structure our models around a set of non-terminal symbols 
that correspond to student decisions and contribute to the 
final output. However, drawing from work on probabilistic 
programs [10}[11], we allow these choices to be made depend- 
ing on previous choices. While dependence on context leads 
to extremely expressive models, we will show that requiring 
some text to be generated at each step is enough for infer- 
ence to remain tractable. We call this class of generative 
models Idea2Text [2] 


Concretely, an Idea2Text simulator consists of a tuple of 
(D, &, S'}) denoting a set of nonterminal decision nodes, a set 
of terminal nodes, and a starting root node, respectively. 
Intuitively, decision nodes correspond to decisions a student 
might make and the terminal nodes correspond to literal text 
tokens in the final output. Each run of the simulator also 
keeps a global state 7 which stores the history of all decisions 
made during the execution (often called an execution trace 
for probabilistic programs [2] [29]). 


Each decision node in D is a tuple (a, Xa, Ia) consisting of 
a unique name, a random variable representing the decision 
choices, and a production program, which (1) specifies how 
this decision should be made based on the decisions made so 
far, and (2) produces an output solution for a given decision 
choice. 


More concretely, the production program is a probabilistic 
function that takes the current decision history 7 and does 
the following: 


(1) Samples the random variable 12 ~ Xa, from a dis- 
tribution, P(X.|r), that can depend on the decision 
history. 


2A very similar class of models, used in the very differ- 
ent domain of customer service, was independently named 
Idea2Text by scientists at Gamalon, Inc. 
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for (int i = 10; ; i--): 
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Figure 3: An example decision expansion step in an Idea2Text 
simulation. The “loop style” decision node chooses a type of for 
loop (e.g. increment vs decrement) and outputs a string contain- 
ing the header of the for loop (terminals) plus another decision 
node. 


(2) Based on the sampled choice, r2, produces an output 
string, y, consisting of literal text (terminal nodes) and 
incomplete segments (decision nodes). These incom- 
plete segments correspond to future decisions that will 
be expanded later (see Fig. [3). 


(3) Returns the sampled choice and output string: (xa, y). 


An output from Idea2Text is a generation from the root node 
to a sequence of literal text by recursively expanding the de- 
cision nodes, as shown in Algorithm [I] Each output is asso- 
ciated with the final decision trajectory, T = [(at, 2a,)|f=1; 
of random variables encountered during generation. Here, az 
denotes the unique name for the random variable encoun- 
tered at timestep t, and xa, is the sampled value for that 
variable. 


We point out some important properties of Idea2Text simu- 
lators. First, each decision node’s choice can depend on the 
decision history 7, allowing past decisions to influence cur- 
rent behaviour. This is strictly more expressive than PCFGs 
[32] and allows instructors to write highly contextual models 
of student problem-solving. Second, production programs 
have the full power of programming languages at their dis- 
posal and can use arbitrarily complicated transformations to 
produce their output sequence. As an example, a production 
program can transform terminal nodes into images or use a 
machine-learning conjugator to conjugate it’s produced text 
into a proper sentence. 


5. INFERENCE 


In this section we describe how we can use an instructor- 
written Idea2Text simulator to learn how to infer the deci- 
sions underlying a student solution. 


At a high-level, the Idea2Text simulator contains the in- 
structor’s mental model for the sequence of decisions stu- 
dents make that result in different solutions. For inference 
we want to do the reverse: given a student solution, we want 
to find a trajectory of decisions in the simulator that would 
produce that solution. 


A model that could successfully do this inference could be 
used to map real student solutions to decision steps in the 
simulator. These extracted decisions are a rich and general 
representation of a student’s solution and can be used for 
downstream tasks such as automated feedback, assisting hu- 
man grading, auditing and interpreting the model decisions, 
and improving the quality of the simulator. 


More formally, let G be a given Idea2Text simulator. Each 
execution of G produces a decision trajectory 7 and corre- 
sponding production y. Since the execution is probabilistic, 


the simulator induces a probability distribution pg (7, y) over 
trajectories and productions. 


Given a student solution, y, we are interested in the task of 
parsing: this is the task of mapping y to the most likely tra- 
jectory in the Idea2Text simulator, arg max; pg(Tly), that 
could have produced y. This is a difficult search problem: 
the number of trajectories grows exponentially even for sim- 
ple grammars, and common methods for parsing by dynamic 
programming (Viterbi, CYK) are not applicable in the pres- 
ence of context-sensitivity and functional transformations. 
What’s more, in order to transfer robustly to real student 
solutions, we would like to be able to approximately parse 
solutions that are not possible to generate from the simula- 
tor, but are sufficiently “nearby”. 


At a high level, our approach is to construct a large data 
set from the simulator and then learn an inverse “inference” 
neural net that can reconstruct the decision trajectory from 
the solution. 


5.1 Adaptive Grammar Sampling 

To train our models, we generate a large dataset of N trajec- 
tories and their associated productions, D = {(r°™, y°™)}*_,, 
by repeatedly executing G. 


However, due to the Zipf-like nature of student work (see 
Sec. [3.2), standard i.i.d. sampling from the simulator will 
tend to over-represent the most probable productions. For 
our models, the more diverse student cognition we can sim- 
ulate in the training data, the more we expect to generalise 
to the long tail of real students. Thus, we need sampling 
strategies that prioritise diversity. 


A simple but flawed idea for generating diverse solutions 
would be to make choices at decision nodes uniformly ran- 
domly instead of using the expert-written distributions. This 
approach will generate more unique productions, but disre- 
garding the expert-written distributions will result in un- 
likely and less realistic productions. 


Ideally, we want to sample in a manner that covers all the 
most likely productions first, and then smoothly transition 
into sampling increasingly unlikely productions. This would 
generate unique productions efficiently while also retaining 
the expert-written distributions specified in the production 
programs. With these desiderata in mind, we propose a 
method called Adaptive Grammar Sampling. For each deci- 
sion node in the simulator, we down-weight the probability 
of sampling each choice proportional to how many times it 
has been sampled in the past. To avoid overly punishing 
decision nodes early in the execution trace, we discount this 
down-weighting by a decay factor d that depends on the 
depth of the decision in the trajectory|?| This method is in- 
spired by Monte-Carlo Tree Search [5] and shares similarities 
with Wang-Landau sampling from statistical physics [27]. 


Fig. |4| shows a comparison of the effectiveness of adaptive 
sampling to uniform and i.i.d. sampling. Adaptive sampling 
interpolates nicely between sampling likely examples early 
on, as iid. sampling does, to sampling unlikely examples 


’The details can be found in the code. 
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Figure 4: Efficiency of sampling strategies for the Liftoff simu- 
lator. (a) Number of unique samples vs total samples so far. (b) 
Probability of sampling a unique next program given samples so 
far. (c) Likelihood of generated samples over time for different 
sampling strategies. 


later, as uniform-choice sampling does. Note that adaptive 
sampling is customisable: as shown in (4k. the algorithm has 
parameters (r and d) that can be adjusted to control how 
fast we explore increasingly unlikely productions. 


5.2 Neural Approximate Parsing 

With good diverse samples available, we now aim to learn 
an approximation to the posterior pg(T|y). We will do so 
by training a deep neural network to reconstruct the trajec- 
tory step-by-step. We call this approach neural approximate 
parsing with generative grading, or GG-NAP. 


The challenge of inference over trajectories is a difficult one. 
Trajectories can vary in length and contain decision nodes 
with different support. To approach this, we decompose 
the inference task into a set of easier sub-tasks, similar to 
[17]. The posterior distribution over a trajectory T = 
(at, La, )e=1 given a production y can be written as the prod- 
uct of individual posteriors over each decision node “a, using 
the chain rule: 


T 


[ [ 29 (arly, x<at) (1) 


t=1 


Pg(a,;- : tap |y) = 


where X<q, denotes previous (possibly non-contiguous) non- 
terminals (%,,...,%a,_,)- Eqn.|1)shows that we can learn 
each posterior p(%a,|X<a,,y) separately. With an autore- 
gressive model M, we can efficiently represent the influence 
of previous nonterminals x<q, using a shared hidden repre- 
sentation over T timesteps. Since most standard choices for 
M (e.g. an RNN) require fixed-dimension inputs, we need to 
encode the solution and the history of choices into consistent 
vectors. 


Firstly, to encode the solution y, we use standard machinery 


(e.g. CNNs for images, RNNs for text) with a fixed output 
dimension. To represent the nonterminal choices with dif- 
ferent support, we define three layers for each random vari- 
able xa,: (1) a one-hot embedding layer that uses the unique 
name a; to lexically identify the random variable, (2) a value 
embedding layer that maps the value of ra, to a fixed dimen- 
sion vector and (3) a value decoding layer that transforms 
the hidden output state of M into parameters of the pos- 
terior for the next nonterminal va,41. Thus, the input to 
the M is a fixed size, being the concatenation of the value 
embedding, name embedding, and production encoding |] 


To train the GG-NAP, we optimise the objective, 


1 toe pote (ro) |yer)) (2) 


m=1 


L£(6) = Eng (7,y) [log Pe (rly)] 


where @ are all trainable parameters and pe(tly) represents 
the posterior distribution defined by the inference engine. At 
test time, given only a production y, GG-NAP recursively 
samples ®a, ~ po(Xa,|Y;X<a,) fort =1,...,T and uses each 
sample as the input to the next step in M, as is standard 
for sequence generation models [12]. 


5.3. KNN Baseline 


As a strong baseline for the parsing task, we consider a near- 
est neighbour classifier. We store our large dataset of sam- 
ples D = {(7°, y(™)}*_1. At test time, given an input 
solution to parse, we can find its nearest neighbour in the 
samples with a linear search of D, and return its associated 
trajectory. Depending on the problem, the solutions y will 
be in a different output space (image, text) and thus the dis- 
tance metric used for the nearest-neighbour search will be 
domain dependent. We refer to this baseline as GG-kNN. 
Note that GG-kNN is quite costly in memory and runtime 
as it needs to store and iterate through all samples in the 
dataset. 


6. EXPERIMENTS 


We test generative grading on a suite of education data 
sets focusing on introductory courses from online platforms 
and large universities. For each dataset, we compare our 
approach to supervised learning, PCFGs, k-nearest neigh- 
bours, and human performance. In Sec. we introduce 
the data sets, then present results in Sec. 


6.1 Datasets 


We consider four educational contexts. Fig.[5|shows example 
student solutions for each problem. 


Block-based Programming Code.org released a data set of 
student responses to eight Blocky exercises from one of their 
curriculums online, which focuses on drawing shapes with 
nested loops. We take the last problem in the curriculum 
(the most difficult one): drawing polygons with an increas- 
ing number of sides—which has 302 human graded responses 
with 26 misconceptions regarding looping and geometry (e.g. 
“missing for loop” or “incorrect angle”) from [32]. 


Free Response Language Powergrading contains 700 re- 
sponses to a United States citizenship exam, each graded for 


“Specific details can be found in the code. 
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Figure 5: We show the prompt and example solutions for our 
four datasets. 


correctness by 3 humans. Responses are in natural language, 
but are typically short (average of 4.2 words). We focus on 
the most difficult question, as measured by [24]: “name one 
reason the original colonists came to America”. Correct re- 
sponses span economic, political, and religious reasons. 


Graphics Programming PyramidSnapshot is a university CS1 
course assignment intended to be a student’s first exposure 
to variables, objects, and loops. The task is to build a pyra- 
mid using Java’s ACM graphics library by placing individ- 
ual blocks. The dataset is composed of images of rendered 
pyramids from intermediary “snapshots” of student work. 
[33] annotated 12k unique snapshots with 5 categories rep- 
resenting “knowledge stages” of understanding. 


University Programming Assignment Liftoff is a second as- 
signment from a university CS1 course that tests looping. 
Students are tasked to write a program that prints a count- 
down from 10 to 1 followed by the phrase “Liftoff”. In Sec.[7] 
we will use Liftoff for a human-in-the-loop study where ex- 
perts generatively grade 176 solutions from a semester of 
students and measure accuracy and grading time. 


6.2 Simulator Descriptions 
We provide a brief overview of the Idea2Text simulators con- 
structed for each domain. 


Block-based Programming The primary innovation is to use 
the first decision node random variable to represents student 
ability. This ability variable will affect the distributions for 
random variables later in the trajectory such as deciding the 
loop structure and body. The intuition this captures is that 
high ability students make very few to no mistakes whereas 
low ability students tend to make many correlated misun- 
derstandings. This simulator contains 52 decision nodes. 


Free Response Language Idea2Text simulators over natural 
language need to explain variance in both semantic mean- 
ing and prose. We inspected the first 100 responses to gauge 
student thinking. Procedurally, the first random variable is 
choosing whether the production will be correct or incor- 
rect. It then chooses a subject, verb, and noun dependent 
on the correctness. Correct answers lead to topics like re- 
ligion, politics, and economics while incorrect answers are 


about taxation, exploration, or physical goods. Finally, we 
add a random variable to decide a writing style to craft a 
sentence. To capture variations in tense, we use a conjuga- 
tor for the final production. This simulator contains 53 
decision nodes. 


Graphics Programming The primary decision in this simula- 
tor decides between 13 “strategies” (e.g. making a parallel- 
ogram, right triangle, a brick wall, etc.) that the instructor 
believed students would use. Each of the 13 options leads to 
its own set of nodes that are responsible for deciding shape, 
location, and colour. The production uses Java to render an 
image output. This simulator contains 121 decision nodes, 
and required looking at 200 unlabelled student solutions in 
its design. 


University Programming Assignment To model student think- 
ing on Liftoff, this simulator first determines whether to use 
a loop, and, if so, chooses between “for” and “while” loop 
structures. It then formulates the loop syntax, choosing a 
condition statement and whether to count up or count down. 
Finally, it chooses the syntax of the print statements. No- 
tably, each choice is dependent on previous ones. For exam- 
ple, choosing an end value in a for loop is sensibly condi- 
tioned on a chosen start value. This simulator contains 26 
decision nodes. 


6.3 Results for Feedback Prediction 


We show the results of generative grading for each of the 
datasets above. 


For each dataset, we have access to a set of real student so- 
lutions and corresponding human-provided feedback labels, 
which we use for evaluation. We measure human accuracy 
relative to the majority label. 


We ask instructors to create an Idea2Text simulator for each 
dataset, and train the deep inference network GG-NAP us- 
ing simulated student solutions. At test time, we pass a real 
student solution into the inference model, and get back a 
trajectory of the simulator. This trajectory contains deci- 
sion node choices that correspond to the human-provided 
feedback labels, and we use these as the predicted feedback. 


Our performance metric for evaluating the model’s predicted 
feedback labels is accuracy or F1 score, depending on the 
convention of prior work. Computing an average of the 
metric across the evaluation dataset would over-prioritise 
examples that appear frequently; this is particularly impor- 
tant to avoid for the Zipf distributed solutions. Since we 
care about providing feedback to struggling students in the 
tail of the distribution, we separately calculate performance 
for different “regions” of the Zipf. Specifically, we define the 
head as the k most popular solutions, the tail as solutions 
that appear only once or twice, and the body as the rest. As 
solutions in the head can be trivially memorised, we focus 
on performance on the body and tail. 


Training Details We report averages over three runs; error 
bars are shown in Fig. [6] We use a batch size of 64, train for 
20 epochs on 100k unique samples adaptively sampled from 
the simulator. We optimise using Adam with a learning 
rate of 5e-4 and weight decay of le-7. For PyramidSnap- 
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Figure 6: Summary of results for providing feedback to student work in three educational contexts: block-based programming, graphics 
programming, and free response language. Generative grading shows strong performance in all three settings, closely approximating 
human-level performance in two data sets, and surpassing human-level performance in the other. 


shot, we use VGG-11 with Xavier initialisation [9] as 
the encoder network. For other data sets, we use a Recur- 
rent Neural Network (RNN) with 4 layers, a hidden size of 
256. The deep inference network itself is an unrolled RNN: 
we use a gated recurrent unit with a hidden dimension of 
256 and no dropout. The value and index embedding layers 
output a vector of dimension 32. These hyperparameters 
were chosen using grid search. 


Code.org As feedback for Code.org exercises has been stud- 
ied in prior work [32], we compare generative grading to 
a suite of baselines including supervised models trained to 
classify misconceptions from the hand-labelled dataset (Out- 
put CNN + Program RNN [82]), unsupervised models 
that learn a latent vector representation of student work 
(MVAE), to the k-nearest neighbours baseline GG-kNN from 
Sec. Most relevant to our approach is the “rubric sam- 
pling” comparison, which uses a PCFG to simulate stu- 
dents and generate a supervised data set to train a RNN 
classifier. Human accuracy is measured by comparing the 
feedback of multiple annotators to the majority label. 


As shown in Fig. generative grading is able to provide 
accurate feedback (historically measured as F1) beyond the 
level of individual human annotators, setting the new state- 
of-the-art. We observe a large improvement over prior work, 
which perform significantly worse than human graders. Com- 
pared to rubric sampling, we find a 18% (absolute) improve- 
ment in the body and a 30% (absolute) improvement in the 
tail. This clearly demonstrates the practical importance of 
being context-sensitive. The global state of Idea2Text simu- 
lators allow us to easily write richer generative models that 
are capable of better simulating real students. The poten- 
tial impact of a human-level autonomous grader is large: 
Code.org is used by 610 million students, and our approach 
could save thousands of human hours for teachers by pro- 
viding the same quality of feedback at scale. 


Powergrading We find similarly strong performance on the 
Powergrading corpus of short answer responses to a citizen- 


ship question. Fig.[6]shows that generative grading reaches a 
F1 score of 0.93, an increase of 0.35 points above prior work 
that used hand-crafted features to predict correctness [6], 
and 0.38 points above supervised neural networks [24]. We 
were unable to compare to rubric sampling [31] as it was too 
difficult to write a faithful PCFG to describe free response 
language. Generative grading takes a large step towards 
closing the gap to human performance (F1 = 0.97). We are 
especially optimistic about these results as Powergrading re- 
sponses contain natural language, this is promising signal 
that ideas from generative grading could generalise beyond 
computer science education. 


PyramidSnapshot Investigating a third modality of image 
output from a graphics assignment, we find similar results 
comparing generative grading to the k-nearest neighbour 
baseline and a VGG image classifier presented in [33], out- 
performing the latter by nearly 50% absolute. 


Unlike other datasets, the PyramidSnapshot dataset includes 
student’s intermediary work, showing stages of progression 
through multiple attempts at solving the problem. With our 
near-human level performance, instructors could use GG- 
NAP to measure student cognitive understanding over time 
as students work. This builds in a real-time feedback loop 
between the student and teacher that enables a quick and 
accurate way of assessing teaching quality and characteris- 
ing both individual and classroom learning progress. From a 
technical perspective, since PyramidSnapshot only includes 
rendered images (and not student code), generative grad- 
ing was responsible for parsing student solutions from just 
images alone, a feat not possible without the flexibility of 
probabilistic programs used in Idea2Text. For this reason, 
we could not apply rubric sampling in this context either. 


7. EXTENSIONS 


Our results show that generative grading is a powerful tool 
for the feedback prediction task. However, our system is 
much more general than this and has many interesting ex- 
tensions that we discuss in this section. 
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Figure 7: CDF of Levenshtein edit distance between student 
programs and nearest-neighbours using various algorithms. 


7.1 Nearest In-Simulator Neighbour 

As described in Section] our inference model learns to map 
a given solution, y, to a decision trajectory in the simula- 
tor. So far, we have used this only to provide feedback with 
a fixed set of labels. However, the decision trajectory has 
a much more powerful interpretation; it represents the se- 
quence of decisions in the simulator that the inference model 
thinks will produce y. Since we have the simulator in hand, 
we can actually execute it with these predicted sequence of 
decisions and inspect the simulated output, 7. 


If the simulated output is exactly equal to the original input 
solution, i.e. if y = Y, then we can make a strong claim: the 
predicted trajectory from the inference model was provably 
correct and the corresponding labels can be assigned with 
100% confidence. This is a claim that is seldom possible 
with traditional supervised learning methods and advances 
efforts towards creating explainable AI. 


What about when y 4 ¥? In this case, the simulated output 
is not an exact match to the student solution, but we can still 
treat it as a “nearest in-simulator neighbour” to y. Fig. 
shows the quality of these nearest neighbours to the student 
solutions using a distance metric like edit distance. 


As we show below, these nearest neighbours can be used for 
powerful forms of feedback mechanisms. 


7.2|_Human-in-the-loop Grading 

In a real-world setting, predicting feedback labels could be 
unreliable due to the high risk of giving students incorrect 
feedback. Beyond automated feedback, we explore how gen- 
erative grading can be used to make human graders more 
effective using a human-in-the-loop approach. 


To do this, we created a human-in-the-loop grading system 
using GG-NAP. For each student solution, we use the in- 
ference model to find the nearest in-simulator neighbour 
(Sec. (7-1); this nearest neighbour already has associated la- 
bels that are correct for the nearest neighbour. A human 
grader is presented with the original student solution, as 
well as a diff to the nearest neighbour; the grader then ad- 
justs the labels of the nearest neighbour based on the diff to 
determine grades for the real solution. We show an image 
of the user-interface of this system in Fig. 


We investigated the impact of this human-in-the-loop sys- 


Grading 


Equivalent Dierent 


‘ountdown extends CansolePragram!{ Loop 
wal int START = 10; 


Student Answer GG-NAP 


Doesn't have a loop 
@ Off by one error 
Confuses > with < 


Loop Style (if loop exists) 


Figure 8: Human-in-the-loop Generative Grading UI 


tem on grading accuracy and speed in a real classroom set- 
ting. We hired a cohort of expert graders (teaching as- 
sistants for a large private university) who graded 30 real 
student solutions to Liftoff. For control, half the graders 
proceeded traditionally, assigning a set of feedback labels 
by just inspecting the student solutions. The other half of 
graders additionally had access to (1) the feedback assigned 
to the nearest neighbour by GG-NAP and (2) a code differ- 
ential between the student program and the nearest neigh- 
bour. Some example feedback labels included “off by one 
increment”, “uses while loop”, or “confused > with <”. All 
grading was done on a web application that kept track of 
the time taken to grade a problem. 


We found that the average time for graders using our sys- 
tem was 507 seconds while the average time using tradi- 
tional grading was 1130 seconds, a more than double in- 
crease. Moreover, with our system, only 3 grading errors 
(out of 30) were made with respect to gold-standard feed- 
back given by the course professor, compared to the 8 errors 
made with traditional grading. Fig. [9a] shows these results 
for each of the 30 solutions. 


The improved performance stems from the semantically mean- 
ingful nearest neighbours provided by GG-NAP. Having ac- 
cess to graded nearest neighbours helps increase grader ef- 
ficiency and reliability by allowing them to focus on only 
“erading the diff” between the real solution and the near- 
est neighbour. By halving both the number of errors and 
the amount of time, GG-NAP can have a large impact in 
classrooms today, saving instructors and teaching assistants 
unnecessary hours and worry over grading assignments. 


7.3 Highlighting feedback in student solutions. 
The inferred decision trajectory for a student solution can 
also be used to provide “dense” feedback that highlights the 
section of the code or text responsible for each misunder- 
standing. This would be much more effective for student 
learning than vague error messages currently found on most 
online education platforms. 


To achieve this, we leverage the fact that each decision node 
in the simulator gets recursively expanded to produce the 
final solution. This means it is easy to track the portions of 
the output that each decision node is responsible for. For 
decision nodes related to student confusions, we can high- 
light the portion of the output in the student solution which 
corresponds to this confusion. Fig. [9b] shows a random pro- 
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Figure 9: (a) Plot of average time taken to grade 30 student solutions to Liftoff. Generative grading reduces grading time for 26 out of 
30 solutions. The amount of time saved correlates with the token edit distance (yellow) to the nearest neighbour in the simulator. (b) 
Our approach allows for automatically highlighting which part of the student solution is responsible for a predicted misconception. (c) 
Given a Liftoff simulator that is missing a key “decrement loop” decision, we can automatically find decision nodes where inference often 
fails on real student solutions. The highest scoring decision nodes are all correctly related looping. 


gram with automated, segment-specific feedback given by 
GG-NAP. This level of explainability is sorely needed in both 
education and AI. 


7.4 Automatically Improving Simulators 
Building Idea2Text simulators is an iterative process; a user 
wishing to improve their simulator would want a sense of 
where it is lacking. Fortunately, given a set of difficult exam- 
ples where GG-NAP does poorly, we can deduce the decision 
nodes in the simulator that consistently lead to mistakes and 
use these to suggest components to improve. 


To do this, for each nearest neighbour to a student solu- 
tion we can find decision nodes that cause substring mis- 
matches in the student solution, using regular expressions. 
This is possible because each decision node is responsible for 
a scoped substring in the nearest neighbour output solution 
(Sec. [7.3). By finding the decision nodes where the sub- 
string often differs between the neighbour and the solution, 
we can identify decisions that often causes mismatches. 


To illustrate this, we took the Liftoff simulator, which con- 
tains a crucial decision node that decides between increment- 
ing up or down in a “for” loop, and removed the option of 
incrementing down. We trained GG-NAP on this smaller 
simulator, and used a scoring mechanism to identify rele- 
vant decision nodes responsible for failing to parse student 
solutions that “increment down”. Fig. [9c]shows the distribu- 
tion over which nodes GG-NAP believes to be responsible 
for the failed parses. The top 6 decisions that GG-NAP 
picked out all rightfully relate to looping and increments. 


8. LIMITATIONS AND FUTURE WORK 


Cost of writing good simulators. One of the most critical 
steps in our approach is the ability to write good Idea2Text 
simulators. Writing a good simulator does not require spe- 
cial expertise and can be undertaken by a novice in a short 
time. For instance, the PyramidSnapshot simulator that 
sets the new state of the art was written by a first-year un- 
dergraduate within a day. Furthermore, many aspects of 
simulators are re-usable: similar problems will share non- 
terminals and some invariances (e.g. the nonterminals that 
capture different ways of writing for loops are the same ev- 
erywhere). This means every additional grammar is easier 
to write since it likely shares a lot in structure with exist- 
ing grammars. Moreover, compared to weeks spent hand- 


labelling data, the cost of writing a grammar is orders of 
magnitude cheaper and leads to much better performance. 


That being said, we believe there is room for interesting 
future work that explores how to make grammars easy to 
write and improve, with the extension in Sec. already 
making some headway in this direction. There is also room 
for better formalising which types of problem domains can 
be faithfully modelled with Idea2Text simulators, and which 
domains are infeasible, like general essay writing. Lastly, 
more sophisticated inference approaches could be explored 
for handling semantic invariances in student output such as 
code reordering or variable renaming. 


Connections to IRT. We find an interesting parallel of our 
work to Item Response Theory (IRT). IRT is essentially an 
extremely simple generative model that relates a student 
parameter 0 to the probability of getting a question correct 
or incorrect. Some of our Idea2Text simulators also incor- 
porate a student ability parameter @ to dictate likelihoods 
of making mistakes at different decisions, and can thus be 
seen as a more expressive and nuanced extension of the IRT 
generative model. Exploring this further is an interesting 
direction of research. 


Generating questions with Idea2Text. We use Idea2Text sim- 
ulators to model student decision-making and corresponding 
example solutions. This could be used to automatically gen- 
erate example solutions with known issues to show students 
for pedagogical purposes. The Idea2Text library can also 
been used to generating questions corresponding to confu- 
sions instead of solutions corresponding to confusions. 


9. CONCLUSION 


We proposed a method for providing automated student 
feedback that showed promising results across multiple modal- 
ities and domains. Our proposed feedback system is capable 
of predicting student decisions corresponding to a given so- 
lution, allowing us to do nuanced forms of automated feed- 
back. With it, “generative grading” can be used to automate 
feedback, visualise student approaches for instructors, and 
make grading easier, faster, and more consistent. Although 
more work needs to be done on making powerful grammars 
easier to write, we believe this is an exciting direction for the 
future of education and a step towards combining machine 
learning and human-centred artificial intelligence. 
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ABSTRACT 


The needs for precisely estimating a student’s academic per- 
formance have been emphasized with an increasing amount 
of attention paid to Intelligent Tutoring System (ITS). How- 
ever, since labels for academic performance, such as test 
scores, are collected from outside of ITS, obtaining the labels 
is costly, leading to label-scarcity problem which brings chal- 
lenge in taking machine learning approaches for academic 
performance prediction. To this end, inspired by the recent 
advancement of pre-training method in natural language 
processing community, we propose DPA, a transfer learn- 
ing framework with Discriminative Pre-training tasks for 
Academic performance prediction. DPA pre-trains two mod- 
els, a generator and a discriminator, and fine-tunes the dis- 
criminator on academic performance prediction. In DPA’s 
pre-training phase, a sequence of interactions where some to- 
kens are masked is provided to the generator which is trained 
to reconstruct the original sequence. Then, the discrimi- 
nator takes an interaction sequence where the masked to- 
kens are replaced by the generator’s outputs, and is trained 
to predict the originalities of all tokens in the sequence. 
We conduct extensive experimental studies on a real-world 
dataset obtained from a multi-platform ITS application and 
show that DPA outperforms the previous state-of-the-art 
generative pre-training method with a reduction of 4.05% 
in mean_absolute error and more robust to increased label- 
scarcity|"] 


Keywords 
Academic Performance Prediction, Deep Learning, Transfer 
Learning, Discriminative Pre-training 


1. INTRODUCTION 


Predicting a student’s future academic performance is a fun- 
damental task for developing modern Intelligent Tutoring 
System (ITS) which aims to provide personalized learning 


'For more detailed descriptions of experimental settings and 
results, please refer the arXiv version of this paper. 
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Figure 1: Interactive features, such as student response and 
elapsed time for the response, are automatically recorded to 
the database whenever a student interacts with ITS. On the 
other hand, more complicated steps are necessary to obtain a 
test score: a student should take the test in the designated 
test center, receive the test score, and report the score to 
ITS. 


experience by supporting educational needs of each individ- 
ual. However, labels for academic performance, such as test 
scores, are often scarce since they are external to ITS. For 
example, as shown in Figure[I] test scores are not automati- 
cally collected inside of ITS. Obtaining a test score requires a 
student to take the test in the designated test center, receive 
the score, and report the score to ITS. Transfer learning is 
a commonly taken approach to address such label-scarcity 
problems across different domains of machine learning. In 
this framework, a model is first pre-trained to optimize aux- 
iliary objectives with abundant data, and then fine-tuned 
on the task of interest. In Artificial Intelligence in Educa- 
tion (AIEd) community, introduced Assessment Model- 
ing (AM), a set of pre-training tasks for label-scarce educa- 
tional problems including academic performance prediction. 
AM proposed a pre-training method where first, a masked 
interaction sequence is generated by replacing a set of in- 
teractive features which can serve as criteria for pedagogi- 
cal evaluation with artificial mask tokens. Then, given the 
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masked interaction sequence, a model is pre-trained to pre- 
dict the masked interactive features. The idea was borrowed 
from the Masked Language Modeling (MLM) pre-training 
method proposed in [7]. In the MLM pre-training method, 
given a masked word sequence where some words in the se- 
quence are replaced with an artificial mask token, a model is 
pre-trained to predict the masked words. However, recently, 
[6] pointed out that the MLM pre-training method has poor 
sample efficiency and suffers from pre-train/fine-tune dis- 
crepancy due to the artificial mask token, and proposed a 
new discriminative pre-training method. Considering the 
problems are also inherent in AM, potential gains are ex- 
pected to be obtainable when the discriminative pre-training 
method is applied to academic performance prediction. 


To this end, we propose DPA, a transfer learning framework 
with Discriminative Pre-training tasks for Academic perfor- 
mance prediction. There are two models in DPA: a generator 
and a discriminator. In DPA’s pre-training phase, the gen- 
erator is trained to predict the masked interactive features 
in the same way as AM. Then, given a replaced interac- 
tion sequence which is generated by replacing the masked 
features with the generator’s outputs, the discriminator is 
trained to predict whether each token in the sequence is the 
same as the one in the original interaction sequence. After 
the pre-training, the generator is thrown away and only the 
discriminator is fine-tuned on academic performance predic- 
tion. Also, we investigate diverse pre-training tasks for the 
generator and show that pre-training the generator to pre- 
dict a student’s response is more effective than to predict 
the correctness and timeliness of their response which were 
considered as the most pedagogical interactive features in 
AM. Extensive experimental studies conducted on a real- 
world dataset collected from a multi-platform ITS appli- 
cation show that DPA outperforms AM with a reduction 
of 4.05% in Mean Absolute Error (MAE) and more robust 
when the degree of label-scarcity increases. 


2. SANTA: A SELF-STUDY SOLUTION 
EQUIPPED WITH AN AI TUTOR FOR 
ENGLISH EDUCATION 


In this paper, we conduct experiments on a real-world dataset 
obtained from Santd?| a multi-platform ITS with more than 
a million users in South Korea available through Android, 
iOS, and Web that exclusively focuses on the Test of English 
for International Communication (TOEIC) standardized ex- 
amination. The publicly accessible version of the dataset 
was released under the name EdNet 4). The TOEIC con- 
sists of two timed sections, Listening Comprehension (LC) 
and Reading Comprehension (RC). There are a total of 100 
multiple choice exercises in each section, and the total score 
for each section is 495 in steps of 5 points. Santa provides 
learning experiences of solving exercises, studying explana- 
tions, and watching lectures. When a student consumes a 
specific learning content, Santa diagnoses their current aca- 
demic status based on their learning activities records and 
recommends another learning content appropriate for their 
current position. Santa records diverse types of interactive 
features, such as student response, the duration of time the 
student took to respond, and the time interval between the 
current and previous learning activities. However, unlike 


‘https: //aitutorsanta.com 


the interactive features automatically collected from Santa, 
obtaining the official TOEIC score requires more steps: a 
student should register and pay for the test, take the test 
in the designated test center, receive the test score from the 
Educational Testing Service, and report the score to Santa 
(Figure [I). Santa collected students’ TOEIC score data by 
offering small gifts to students when they report their scores. 


3. TRANSFER LEARNING FOR ACADEMIC 
TEST PERFORMANCE PREDICTION 


To overcome the label-scarcity problem in academic test per- 
formance prediction, we consider burgeoning machine learn- 
ing discipline of transfer learning. There is an open issue of 
what information to transfer or which pre-training task is 
the most effective for academic test performance prediction. 
Previous studies proposed two types of pre-training meth- 
ods for AIEd Tasks: interaction-based method which mod- 


els students’ dynamic learning behaviors 3], and 


content-based method which learns representations of learn- 
ing contents [14] [23} [19] [24] [29]. showed that interaction- 
based pre-training method outperforms content-based pre- 
training methods when the pre-trained model is fine-tuned 
on several label-scarce educational tasks including academic 
test performance prediction. Following this line of research, 
we propose a transfer learning framework where a model is 
pre-trained using only student interaction data, and fine- 
tune the pre-trained model on academic test performance 
prediction. In this paper, we consider the following interac- 
tive features: 


e eid: A unique ID assigned to an exercise solved by a 
student. There are a total of 14419 exercises in the 
dataset. 


e part: Each exercise belongs to a specific part that rep- 
resents the type of the exercise. There are a total of 7 
parts in the TOEIC. 


e response: Since the TOEIC consists of multiple choice 
exercises and there are four options for each exercise, 
a student response for a given exercise is one of the 
options, ‘a’, ‘b’, ‘c’, or ‘d’. 


e correctness: Whether a student responded correctly 
to a given exercise. Note that correctness is a coarse 
version of response since correctness is processed by 
comparing response with a correct answer for a given 
exercise. 


e elapsed_time: The amount of time a student spent on 
solving a given exercise. 


e timeliness: Whether a student responded to a given 
exercise under the time limit. Note that timeliness is 
a coarse version of elapsed_time since timeliness is pro- 
cessed by comparing elapsed_time with the time limit 
recommended by domain experts for a given exercise. 


e exp_time: The amount of time a student spent on 
studying an explanation for an exercise they had solved. 


e inactive_time: The time interval between the current 
and previous interactions. 
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Figure 2: The overall pre-training/fine-tuning process of DPA when each token in an interaction sequence is a set of eid, part, 
and response, and response is a feature being masked. mask and cls are special tokens for mask and classification, respectively, 


which are the same as the ones used in [7]. 


In our experiments, we normalize the values of elapsed_time, 
exp_time, and inactive_time so they are between 0 and 1 to 
stabilize the training process. 


4. PROPOSED METHOD 


F igure [2] depicts our proposed method. There are two mod- 
els in DPA: a generator and a discriminator. In pre-training 
phase, given a sequence of interactions I = [h,...,Jr], 
where each interaction I; = {fi,..., ff} is a set of inter- 
active features f;, such as eid, part, and response, a masked 
interaction sequence I” = [Ii",..., 1] is generated by 
first randomly selecting a set of positions to mask M = 
{Mi,...,Mm} (m < T), and for the masked position MM, 
masking out a fixed set of features {fiz,,---, fz, } (n < k). 
For instance, in Figure] if the original interaction sequence 
is [(e419, part4, b), (e23, part3, c), (e4324, part3, a), (e5233, 
part1, a)] where each token in the sequence is a set of eid, 
part, and response, a masked interaction sequence where 
M = {2,3} and response as a masked feature is [(e419, 
part4, b), (e23, part3, mask), (e4324, part3, mask), (e5233, 
partl, a)]. Then, the generator takes the masked interac- 
tion sequence I™ as an input, and outputs predicted values 
og for the masked features fj,,. After that, a replaced 
interaction sequence [® = Re ..., 17] is generated by re- 
placing the masked features fu, with the generator’s pre- 
dictions o§. In Figure since the generator’s outputs 
for the masked features are ‘b’ and ‘a’, a replaced inter- 
action sequence is [(e419, part4, b), (e23, part3, b), (e4324, 
part3, a), (e5233, partl, a)]. Then, the discriminator takes 
the replaced interaction sequence J* as an input, and pre- 
dicts whether each token in the sequence is the same as the 
one in the original interaction sequence (original) or not (re- 
placed). After the pre-training, we throw away the generator 
and fine-tune the pre-trained discriminator on academic test 
performance prediction. We provide detailed explanations of 
each component in the generator and the discriminator, and 
training objective functions in the following subsections. 


4.1 Interaction Embeddings 


The embedding layer produces a sequence of interaction em- 
bedding vectors by mapping each interactive feature to an 
appropriate embedding vector. We take two different ap- 
proaches to embed the interactive features depending on 
whether they are categorical (eid, part, response, correctness, 
and timeliness) or continuous (elapsed_time, exp_time, and 
inactive_time) variables. If an interactive feature is a cate- 
gorical variable, we assign unique latent vectors to possible 
values of the feature including special values for mask (mask) 
and classification (cls). Take response as an example, there 
is an embedding matrix Exesponse € IRoX4emb where each row 
vector is assigned to one of ‘a’, ‘b’, ‘c’, ‘d’, mask, and cls. 
If an interactive feature is a continuous variable, we assign 
a single latent vector to the feature. Then, an embedding 
vector for the feature is computed by multiplying the latent 
vector and a value of the feature. For instance, we compute 
an embedding vector for elapsed_time as et * Ectapsed_time, 
where et is a specific value and Fetapsedtime € Riéemb is 
a latent vector assigned to elapsed_time. Also, mask and 
classification for the continuous interactive features are in- 
dicated by setting their values to -1 and 0, respectively. Not 
only embeddings for interactive features, positional embed- 
dings are also incorporated into Transformer-based models 
to consider chronological order of each token. Rather 
than using conventional positional embeddings which stores 
an embedding vector for every possible position, we adopt 
axial positional embeddings to further reduce memory 
usage. The final interaction embedding vector of dimension 
demp for each time-step is the sum of all embedding vectors 
in the time-step. The interaction embedding layer is shared 
by both the generator and the discriminator. 


4.2 Performer Encoder 

Since its successful debut in Natural Language Processing 
(NLP) community, Transformer’s attention mechanism has 
become a common recipe adopted across different domains of 
machine learning including speech processing [18], computer 
vision (|, and AIEd [22]. Compared to Recur- 
rent Neural Network (RNN) family models, Transformer’s 
attention mechanism has benefits of capturing longer-range 
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Figure 3: The reversible layer in the Performer encoder is 
composed of the FAVOR-+-based multi-head attention layer 
and the point-wise feed-forward layer. 


dependencies and allowing parallel training, which enables 
the model to achieve better performance with less training 
time. However, despite these advantages, the time and mem- 
ory complexities of computing the attention grow quadrat- 
ically with respect to input sequence length, requiring de- 
manding computing resources for training the model on long 
sequences. For instance, if L is input sequence length and d 
is dimension of query, key, and value vectors, Transformer’s 
attention is computed as follows: 


= 
Attention(Q, K,V) = softmax (27) Vz 


where Q, K,V € R”*?. The time and memory complexities 
for computing QK' in the above equation are O(L*d) and 
O(L?), respectively. Therefore, the cost for training Trans- 
former becomes prohibitive with large L, preventing training 
the model even on a single GPU. 


The problem of improving the efficiency of Transformer’s at- 
tention mechanism is a common concern of machine learn- 
ing community. Recent studies have proposed several meth- 
ods to reduce the computing complexities lower than the 
quadratic degree with respect to input sequence length 


[5]. In this paper, we adopt Performer since 


it uses reasonable memory and makes a better trade-off be- 


tween speed and performance [26]. Performer approximates 
attention kernels through Fast Attention Via positive Or- 
thogonal Random features (FAVOR+) approach. For those 
who want to know more about FAVOR-4, please refer [5]. 


With the efficient attention mechanism by FAVOR+, we 
propose the Performer encoder which is stacks of several 
identical reversible layers described in Figure The re- 
versible layer is based on Reversible Transformer 
architecture to further improve memory efficiency in back- 
propagation. An input of the reversible layer « € R&™*4hidaen 
is first chunked to 71,22 € IRE X¢hidden/? Then, scaled lg 
normalization (ScaleNorm) and FAVOR-+-based multi- 
head attention layer (MultiHeadAttn) are applied to x2, and 
the result is added to x1 to compute yi € REX ¢hidden/? 


y. = 21 + MultiHeadAttn(ScaleNorm(22)). 


After that, the scaled lz normalization and point-wise feed- 
forward layer (FeedForward) are applied to y1, and the result 
is added to x2, computing yo € RY X¢hidden/? 


y2 = 2 + FeedForward(ScaleNorm(y1)). 


An output of the reversible layer y € R&*7hédd4en ig a con- 
catenation of y; and y2. We stack the reversible layer mul- 
tiple times to allow the final model to sufficiently represent 
underlying data distribution. 


4.3 Generator 

The generator computes hidden representations [Ag, ee AG] 
by feeding the masked interaction sequence I” to a se- 
ries of the interaction embedding layer (InterEmbedding), a 
point-wise feed-forward layer (GenFeedForward1), the Per- 
former encoder (GenPerformerEncoder), and another point- 
wise feed-forward layer (GenFeedForward2): 


Ue, ..., 14”) = InterEmbedding([I,... , 17"]) 

[Age ,..., hE" ] = GenFeedForward1({I",..., I“”]) 
At’ bie 5 he” = GenPerformerEncoder({h¢", ae Ae”) 
[ 


h¢,...,hG] = GenFeedForward2({h¢”,...,hF"]), 


where I!” , nS € Reem and hGF , nGP E€ Résen-niaden | Then, 
depending on whether the masked features are categorical 
or continuous variables, generator outputs are computed dif- 
ferently. If the masked features are categorical variables, the 
outputs are sampled from a probability distribution defined 
by the following softmax layer: 


OF ms Pa(fi, r) = softmax(E;h{, )- 


If the masked features are continuous variables, the outputs 
are computed by the following sigmoid layer: 


On = sigmoid(E} hr, )- 


Similar to the case of categorical masked features, one can 
sample the outputs from a probability distribution defined 
by I™ and parameters of the generator when the masked fea- 
tures are continuous variables. For instance, the outputs can 
be sampled from the Gaussian distribution where the mean 
and the variance are determined by J” and the generator’s 
parameters. However, we make the outputs deterministic 
because sampling the outputs underperforms in our prelim- 
inary experiments when the masked features are continuous 
variables. 
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4.4 Discriminator 

In pre-training, outputs of the discriminator OP = (oP, ia 
OF] is computed by applying a series of the interaction em- 
bedding layer (InterEmbedding), a point-wise feed-forward 
layer (DisFeedForward1), the Performer encoder (DisPer- 
formerEncoder), and another point-wise feed-forward layer 
(DisFeedForward2) to the replaced interaction sequence I”: 


IP? ..., 1f”] = InterEmbedding([If’, ... , I7]) 

hp’ ,...,hp*] = DisFeedForward1([I7”, ..., I2)) 
ae. he = DisPerformerEncoder([h?” , aoe ie |) 
O?,...,O7] = DisFeedForward2({h?’,...,h7"]), 


where I?” € Reem’, APF APP ¢€ Rédisnidden OP ER, and 
the sigmoid is applied to the last layer of the discriminator. 
After the pre-training, we slightly modify the discriminator 
by replacing the last layer with a layer having appropriate 
dimension for academic test performance prediction. 


4.5 Training Objectives 

The objective for pre-training is to minimize the following 
loss function: 

mn Te 

x5 3 GenLoss(O§, i) +2 > DisLoss(OP, 1(I,* = Ik), 


i=1 j=1 t=1 


where GenLoss is the cross entropy (or mean squared error) 
loss function if the masked features are categorical (or con- 
tinuous) variables, DisLoss is the binary cross entropy loss 
function, and 1 is the identity function. For ease of nota- 
tion, we omit an index for each input sample in the above 
equation. If there are more than one masked features in 
each time-step (n > 1), the generator is trained under the 
multi-task leaning scheme. The objective for fine-tuning is 
to minimize the mean squared error loss between the model’s 
predictions and score labels. 


5. EXPERIMENTS 


5.1 Effects of Generator’s Pre-training Tasks 
There are multiple interactive features to be masked in each 
token of the interaction sequence, which raises a question 
of how to construct a set of masked interactive features, 
and accordingly, which pre-training task for the generator is 
the most effective for academic test performance prediction. 
By default, all interactive features listed in Section |3] are 
taken as inputs for both the generator and discriminator. 
However, if response (or elapsed_time) is masked, correctness 
(or timeliness) is excluded from the inputs and vice versa 
since there is an overlap of information that the features 
represent. For example, when both response and correctness 
are taken as inputs, and correctness is masked, the generator 
can predict the masked correctness by only looking at eid 
and response without considering other interactions, which 
leads to poor pre-training. The results are described in Table 


O 


The best result was obtained under the pre-training task of 
predicting response alone, which is slightly better than that 
of predicting correctness, and both response and correctness. 
Predicting correctness of student response is an important 
task in AIEd as can be seen from the large volume of stud- 
ies about Knowledge Tracing. Also, empirically showed 


Table 1: Comparison between different pre-training tasks. 


Pre-training task MAE 

response 50.65 + 1.26 
response + elapsed_time 54.86 + 1.64 
response + timeliness 52.91 + 1.38 
response + exp_time 57.54 + 1.47 
response + inactive_time 60.69 + 1.74 
correctness 51.36 + 0.97 
correctness + elapsed_time 53.36 + 1.43 
correctness + timeliness 52.60 + 1.20 
correctness + exp_time 54.36 + 1.62 
correctness + inactive_time 55.04 + 1.58 
response + correctness 51.13 + 1.60 
response + correctness + elapsed_time | 52.15+ 1.43 
response + correctness + timeliness 53.05 + 1.81 
response + correctness+ exp_time 53.09 + 1.25 
response + correctness + inactive_time | 56.41 + 1.72 


that student response correctness is the most pedagogical 
interactive feature for academic test performance predic- 
tion. However, rather than pre-training a model to predict 
whether a student correctly responded to a given exercise, 
the pre-training task of predicting student response itself 
injects more fine-grained information into the model, which 
leads to the more effective pre-training for academic test 
performance prediction. Interestingly, the underperformed 
results were obtained when predicting elapsed_time or time- 
liness in pre-training despite the benefits their information 
bring to several AIEd tasks [10] [30] [22]. We hypothesize that 
elapsed_time and timeliness may introduce irrelevant noises 
and thus guide the model towards a direction inappropri- 
ate for academic test performance prediction. In the case of 
exp_time and inactive_time, we observed that the generator 
failed to learn to predict their values when only given the 
interactive features listed in Section] which leads to unsta- 
ble pre-training. From these observations, in the following 
subsections, we conduct experimental studies based on the 
pre-training task of predicting response alone. 


5.2 DPA vs. Baseline Methods 
We compare DPA with the following pre-training methods: 


e No pre-training: We train the fine-tuning models only 
on the fine-tuning dataset. 


e Autoencoding: Autoencoding (AE) is a generative pre- 
training method widely used across different domains 
of machine learning including AIEd [3]. Given an 
unmasked interaction sequence, AE pre-trains a model 
to reconstruct the input interaction sequence. 


e Assessment Modeling: Assessment Modeling (AM) 
is the previous state-of-the-art generative pre-training 
method for academic test performance prediction. In 
AM, a model takes a masked interaction sequence as 
an input and is pre-trained to predict masked features. 
AM is exactly the same as fine-tuning the pre-trained 
generator in DPA. 


Also, we investigate whether DPA is effective with the fol- 
lowing different fine-tuning models: 
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Table 2: Comparison of DPA with baseline methods. 


Pre-training method | Fine-tuning model MAE 

No pre-training MLP 82.89 + 3.23 
BiLSTM 84.05 + 2.06 
Transformer encoder | 107.06 + 2.52 
Performer encoder 81.76 + 1.24 

AE MLP 79.46 + 1.15 
BiLSTM 85.64 + 1.89 
Transformer encoder | 75.13 + 3.10 
Performer encoder 64.80 + 1.43 

AM MLP 7717+ 2.14 
BiLSTM 58.16 + 1.28 
Transformer encoder | 57.16 + 2.08 
Performer encoder 52.79 + 1.39 

DPA MLP 77.24 + 1.59 
BiLSTM 57.59 + 1.76 
Transformer encoder | 55.99 + 1.62 
Performer encoder 50.65 + 1.26 


e MLP: Multi-Layer Perceptron (MLP) is stacks of sim- 
ple fully-connected layers. Given an interaction se- 
quence, interaction embedding vectors of all time-steps 
are summed together to compute a fixed-dimensional 
vector which is fed to a series of the fully-connected 
layers. 


e BiLSTM: Bi-directional Long Short-Term Memory (BiL- 


STM) is a model widely used for time series data pre- 
diction tasks. The global max pooling layer is ap- 
plied on top of the BiLSTM layer to obtain a fixed- 
dimensional intermediate representation from an input 
sequence of varying length. 


e Transformer Encoder: Transformer Encoder is a series 
of several identical layers composed of a multi-head 
self-attention layer with the softmax attention kernel 
and a point-wise feed-forward layer. We set the Trans- 
former encoder’s attention window size to 512 due to 
the out of GPU memory occuring when training the 
Transformer encoder of 1024 attention window size on 
our single GPU machine. 


As described in Table[2| transferring the pre-trained knowl- 
edge brings better results in most cases, and the best result 
is obtained from DPA. Especially, when the Performer en- 
coder, the best performing fine-tuning model, is used as the 
fine-tuning model, DPA reduces MAE by 4.05%, 21.84%, 
and 38.05% compared to AM, AE, and No pre-training, re- 
spectively. Among the baseline pre-training methods ex- 
cluding No pre-training, the worst result is obtained from 
AE beacuse the pre-training task of AE is much easier than 
that of AM and DPA. We observed that the loss curve of 
AE converged to near zero within the first pre-training eval- 
uation. 


5.3. Robustness to Increased Label-scarcity 

Since the motivation behind our proposal of DPA is the 
label-scarcity problem, we investigate how MAE changes 
at varying degrees of label-scarcity. Figure |4] and Table 
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Figure 4: The black, blue, and red lines represent MAEs 
for No pre-training, AM, and DPA, respectively, when the 
number of fine-tuning training samples becomes 1/2, 1/4, 
and 1/8 of the entire dataset. 


Table 3: Comparison of DPA with AM and No pre-training 
at varying degrees of label-scarcity. 


N No pre-training | AM DPA 

1/8 | 94.21 + 8.40 60.22 + 1.86 | 55.90 + 1.97 
1/4 | 89.01 + 2.14 57.08 + 1.75 | 53.46 + 1.45 
1/2 | 85.374 1.15 54.29 + 1.50 | 51.38+ 1.16 
Full | 81.76 + 1.24 52.79 £1.39 | 50.65 + 1.26 


describe the results when using 1/2, 1/4, and 1/8 of the 
total number of fine-tuning training samples. In all de- 
grees of label-scarcity, DPA consistently outperforms AM. 
Also, DPA fine-tuned on 1/2, 1/4, and 1/8 of the dataset 
outperforms AM fine-tuned on the entire dataset, 1/2, and 
1/4 of the dataset, respectively, which shows that DPA is 
more robust to label-scarcity than AM. Compared with No 
pre-training, the gap between No pre-training and the other 
two pre-training methods increases as the number of labels 
becomes scarce. Furthermore, the other two pre-training 
methods fine-tuned on 1/8 of the dataset outperform No 
pre-training fine-tuned on the entire dataset. 


6. CONCLUSION 


In this paper, we proposed DPA, a transfer learning frame- 
work with discriminative pre-training tasks for academic 
performance prediction. Our experimental results showed 
the effectiveness of DPA for the label-scarce academic per- 
formance prediction task over the previous state-of-the-art 
generative pre-training method. Avenues of future research 
include investigating more effective pre-training tasks for 
academic performance prediction and pre-train/fine-tune re- 
lations in AIEd. 
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ABSTRACT 


Student modeling is useful in educational research and tech- 
nology development due to a capability to estimate latent 
student attributes. Widely used approaches, such as the 
Additive Factors Model (AFM), have shown satisfactory re- 
sults, but they can only handle binary outcomes, which may 
yield potential information loss. In this work, we propose 
a new partial credit modeling approach, PC-AFM, to sup- 
port multi-valued outcomes. We focus particularly on the 
amount of assistance, that is, the number of error feedback 
and hint messages, a student needs to get a problem step 
correct. Because errors and hint requests may not only de- 
rive from student ability, but also from non-cognitive fac- 
tors (e.g., students may game the system), we first test PC- 
AFM on synthetic data where this source of variation is not 
present. We confirm that PC-AFM is indeed better than 
AFM in recovering the true student and knowledge com- 
ponent (KC) parameters and even predicts student error 
rates better than a model fit to error rates. We then ap- 
ply the approach to six real-world datasets and find that 
PC-AFM outperforms AFM in reliable estimation of KC 
parameters and produces better generalization to new stu- 
dents, which requires better KC estimates. However, con- 
sistent with the hypothesis that student assistance behavior 
is driven by motivational or meta-cognitive factors beyond 
their ability, we found that PC-AFM was not better in reli- 
able estimation of student parameters nor in generalization 
across items, which requires accurate student estimates. We 
propose cross-measure cross-validation as a general method 
for comparing alternative measurement models for the same 
desired latent outcome. 
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1. INTRODUCTION 


Student modeling has been an important tool that researchers 
can use to estimate latent student abilities. Similarly, in- 
telligent tutoring systems also depend on how accurately 
we can predict student mastery to deliver efficient adap- 
tive learning. Current popular approaches, such as Additive 
Factors Model (AFM) [4, 18, 13] and Bayesian Knowledge 
Tracing (BKT) [5, 13], perform reasonably well by includ- 
ing the growth factors in their models. However, they are 
restricted by using only binary student performance (e.g. 
correct/incorrect response), which could suffer from an in- 
formation loss due to its dichotomized nature. 


For example, many existing intelligent tutoring systems (ITS) 
support step-by-step interactions [22], which usually allow 
students to try multiple attempts or request for hints un- 
til they are able to complete the step correctly. These in- 
teractions are important for an ITS because it allows the 
system to provide immediate feedback or support an adap- 
tive experience, while collecting a rich interaction dataset 
on student actions. However, since AFM and BKT can only 
handle binary outcomes, the student data is needed to be 
aggregated through a rollup procedure before we can use it 
in student modeling. This means only success on students’ 
first attempt on each step will be included in the data, and 
the rest of the actions (e.g. other attempt or hint requests) 
will be ignored. To illustrate how this could be problem- 
atic, let’s imagine student A who had one incorrect attempt 
on a step before correctly completing it and student B who 
had multiple incorrect attempts and asked for multiple hints 
on the same step before getting it right. The dichotomous 
model like AFM and BKT would treat both students as the 
same on this particular step, but we can see that it is more 
likely that student A has demonstrated better knowledge 
than student B. 


In our case, we are concerned with having a raw measure 
of student success at each assessment opportunity. There 
are different functions for producing or deriving an outcome 
measure for the data available in a tutoring system. Perhaps 
the most typical function is: first transaction correct = 1; 
otherwise = 0 where both hints and incorrect responses are 
both counted as a failure. While there are multiple ways 
to elicit polytomous outcomes from ITS student data, in 
this work we focus on an assistant score, which is a total 
number of incorrect attempts and hint requests combined 
for each step. From our preliminary analysis, we found that 
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there are correlations between assistance scores and AFM’s 
predicted error rate, which suggests that there could be an 
extra information in assistance scores compared to a binary 
correctness outcome. 


In this work, we are interested in whether or not an assis- 
tance score model could be a better predictor of student’s 
change in performance than a dichotomous model like AFM. 
Particularly, our research questions are: (1) How can we de- 
velop an effective statistical measurement model that uses 
assistance scores? and (2) How do we compare two different 
response models? 


A popular approach to compare different cognitive models 
in Educational Data Mining is to use goodness-of-fit (e.g. 
Bayesian Information Criterion), but it is not applicable in 
our scenario because our model is based on different out- 
comes (correctness vs assistance score). Alternative versions 
of measures of predictor variables can be contrasted through 
cross validations, but it becomes inadequate when the out- 
come variables are different. We also discuss a set of strate- 
gies for addressing the general problem of how to compare 
alternative measurement models for the same desired latent 
outcome. Particularly, how do we compare a binary correct- 
ness model with a polytomous Assistance Score model? 


We propose a new cognitive modeling approach to support 
polytomous outcomes and demonstrated its ability to re- 
cover parameters and predict student error rates better than 
AFM in synthetic data. We then evaluated our model to six 
real-world datasets spanning five different domains from the 
DataShop repository [10]. We found that our model outper- 
forms AFM in most Student-blocked CVs and estimating 
KC parameters, but it falls short at estimating student in- 
tercepts. We hypothesize that our model is struggling to es- 
timate student parameters in the real-world datasets due to 
variance in students’ help-seeking behavior, such as gaming- 
the-system, that leads to the extra variance in Assistance 
Scores above and beyond the variance associated with stu- 
dent ability. 


2. RELATED WORK 

2.1. Item Response Theory with Partial Credit 
Item Response Theory (IRT) models [6] is the preferred 
method used in several state assessments in the United States 
and international assessments [8]. The goal of the IRT model 
is to estimate the latent construct (e.g. student ability) and 
item characteristics (item difficulty) based on only a collec- 
tion of responses. 


The simplest variation of IRT is the Rasch model (1PL 
model) [19], which is characterized by a single parameter 
representing item difficulty (d;), and a single parameter rep- 
resenting student ability (a;). As Eq.1 is equivalent to a 
logistic function, the Rasch model is essentially a logistic 
regression model. 


1 
14—(ai—4;) 


P(r =1) = (1) 


Other variations increase the complexity by introducing ex- 
tra parameters. For example, the 2PL model adds a discrim- 


ination parameter for each item that controls the slope of 
the logistic function, and the 3PL model that also includes 
a pseudo-guessing parameter for each item. Even though, 
these models are characterized by a different number of pa- 
rameters, they are all based on dichotomous response data 
(e.g. correctness). There is another class of IRT models 
that can be applied to polytomous outcomes, where each 
response can be a different value [17, 21]. An example of re- 
sponses that is applicable to this class of models are Likert 
scale. There are different variations of polytomous IRT mod- 
els, such as Partial Credit Model (PCM) [14], Generalized 
Partial Credit Model (GPCM) [15], and Graded Response 
Model (GRM) [20].These polytomous models are generalized 
from the dichotomous IRT models and can be reduced to the 
dichotomous IRT models when there are only two response 
categories. Our model extends the polytomous model to in- 
clude growth factor by applying a similar approach to PCM 
to AFM. 


2.2 Knowledge Tracing Approaches 

Intelligent tutoring systems (ITS) have been shown to be 
effective in improving student learning outcomes across dif- 
ferent domains [2, 9], and mastery learning strategies have 
been an important component in these systems. To im- 
plement mastery learning, knowledge tracing techniques are 
regularly utilized by ITSs [7] to adaptively assess students’ 
knowledge states, which is used to decide when students have 
mastered skills and are ready to move on to other skills. 


In many existing ITSs, such as Cognitive Tutor Authoring 
Tools (CTAT) [1], students are given a number of practice 
opportunities for each skill , and students are usually allowed 
to try multiple attempts or request for hints until they are 
able to successfully complete the step on each practice op- 
portunity. The goal of a knowledge tracing algorithm when 
used for mastery learning is to determine when to stop giv- 
ing students practice opportunities for the given skill. 


Knowledge tracing is often performed by a statistical model 
of student learning that could be fit to data. There are 
two popular families of methods [12]: Bayesian Knowledge 
Tracing (BKT) [5, 13] and Additive Factors Model (AFM) 
[4, 18, 13]. Both methods include growth factors in order to 
estimate students’ performance as it is changing with learn- 
ing. BKT models student knowledge as a latent variable 
in a Hidden Markov Model. AFM is an extension of the 
IRT model that includes learning opportunity counts in the 
model. Even though these methods have been proven to 
work well in many scenarios, they are based on the binary 
error measurement model (correct or incorrect) and thus do 
not make use of potential added information from the num- 
ber of error and hint messages a student may receive. Our 
approach explores this opportunity by extending AFM to 
use such multi-valued or polytomous outcomes in hopes of 
better estimating student knowledge. While other variations 
on AFM, such as Performance Factor Analysis (PFA) [18] 
and individualized AFM (iAFM) [13], have been shown in 
some cases to produce better prediction fit than AFM, we 
chose to use AFM to simplify the contrast between binary 
and polytomous measurement models and with the goal of 
producing more parsimonious and interpretable parameter 
estimates. Future work can explore alternatives. 
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2.3 DataShop Data Features 


In this work, we use a variety of real world datasets across 
different domains from the DataShop repository [10]. Learn- 
Lab’s DataShop (http://learnlab.org/datashop) is an open 
data repository of educational data with associated visual- 
ization and analysis tools, which has data from thousands of 
students derived from interactions with on-line course ma- 
terials and intelligent tutoring systems. 


In DataShop terminology, Knowledge Components (KCs) 
are used to represent pieces of knowledge, concepts or skills 
that students need to solve problems [11]. When a specific 
set of KCs are mapped to a set of instructional tasks (usually 
steps in problems) they form a KC Model, which is a specific 
kind of student model. 


Each dataset in DataShop consists of a set of student trans- 
actions, which is a collection of students’ interactions with 
ITSs. The collected students’ actions include (but not lim- 
ited to) correct attempts, incorrect attempts, and hint re- 
quests. The transactions that belong to the same prac- 
tice opportunity get aggregated into a single students’ step 
through the rollup procedure. The correctness of the step 
depends on the result of the student’s first response for the 
practice opportunity, and the total number of incorrect at- 
tempts and hint requests is reported as an Assistance Score 
of the step. Most existing knowledge tracing algorithms use 
students’ steps, rather than transactions, in their models. 


3. METHOD 

The Additive Factors Model (AFM) [4] is a logistic regres- 
sion that extends Item Response Theory by incorporating a 
growth or learning term. The model gives the probability pj; 
that a student 7 will get a problem step 7 correct based on the 
student’s baseline ability (0;), the baseline difficulty of the 
related KCs on the problem step (Gx), and the learning rate 
of the KCs (7). The learning rate represents the improve- 
ment on a KC with each additional practice opportunity, so 
it is multiplied by the number of practice opportunities (Tix) 
that the student already had on the KC. 


log) = 01 + Uk (qin Be + Qin YeTix) (2) 
ig 


Our extension of AFM to support a polytomous outcome 
measure, like Assistance Score, is inspired by the Partial 
Credit Model (PCM) [14], which is an adjacent-categories 
logit model [21]. The model was designed to work with or- 
dered polytomous response categories with a specific order 
or ranking of responses, which is the case for Assistance 
Score. It is widely applied in aptitude testing to allow for 
partial credit for near correctness of a response. In adjacent- 
categories logit models, we model the odds of a higher cat- 
egory relative to the adjacent lower one, and this paired 
comparison creates the ordering of the categories. 


Assistance Score can be interpreted in the partial credit 
framework as follows. A student who gets a problem step 
correct on their first try or after fewer errors or hint requests 
is more likely to have the associated competence than a stu- 
dent who makes many errors or requests multiple hints be- 
fore getting the step correct. Thus, students making no er- 


rors and needing no hints get full credit (Assistance Score = 
0) and students with errors and/or hint requests get partial 
credit in rough proportion to the number hint and errors. 


The Partial Credit Additive Factors Model (PC-AFM) builds 
upon these two different statistical models, AFM and PCM. 
For a student i and a step j, there is a set of probabilities 
Pi; = {pija;a = 0,1,..., A} describing the chance for student 
i to get Assistance Score a on the step j, where A is the max- 
imum Assistance Score. In this work, we decided to limit an 
Assistance Score at 5 because values above this tend not to 
be meaningful and rare, but extreme outliers (e.g., where 
assistance score is over 20 or even 140!) would significantly 
bias the model. 98% of our data have an Assistance Score 
of 5 or less. We extend AFM to use multivariate general- 
ized linear mixed model, and the link function in logistic 
regression takes the vector-valued form. 


frink,1 (Pis) log iso) 
flink (Pij) = ce ~ Pig A 8) 
fink, A(Pis) log(s4-a) 


Note that fiing,o is not included due to the number of non- 
redundant probabilities. PC-AFM use adjacent-categories 
logits as a link function based on PCM. The ath adjacent- 
categories logit is the logit of getting an Assistance Score 
a versus a— 1. Each link function is an extended version 
of AFM’s linear model (Eq. 2) with a level parameter (aa), 
which represents the difficulty to improve from an Assistance 
Score a toa—1. 


frink,a( Piz) = 01 + Qa + Ue (Qj Be + Ge VeTix) (4) 


Inverting this function gives an expression for the probabil- 
ities of student 7 to complete a problem step j with each of 
the possible Assistance Scores a. 


(5) 
n= {t ifa=0 


M1 flink,(Pi;) otherwise 


4. EXPERIMENT 

We conduct experiments on both synthetic data and real 
student data to evaluate the performance of PC-AFM. We 
used the synthetic data to validate PC-AFM’s parameter re- 
covery capability and examine our evaluation strategy in a 
synthetic environment in which Assistance Score is stochas- 
tically derived from student ability alone. In particular, As- 
sistance Scores in the synthetic data are not confounded by 
other student variations, such as their motivational state. 
We hypothesized that PC-AFM would work less effectively 
with the real student data because of non-ability effects on 
Assistance Score, such as students’ help seeking strategies 
or propensity to game the system. 


While goodness-of-fits metrics, such as BIC, are widely used 
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to compare different cognitive models [16], such as knowl- 
edge tracing algorithms, it is not applicable in our case due 
to the difference of outcome measures between AFM and 
PC-AFM. The challenge is how we can compare models that 
are based on different outcomes (error rate vs Assistance 
Score), while targeting the same desired latent measure (e.g. 
student’s ability). 


We explore two strategies to tackle this comparison problem. 
The first approach is to use parameter estimate reliability in 
split-half comparisons. Since both AFM and PC-AFM share 
the majority of their parameters (student intercepts, KC in- 
tercepts, and KC slopes), we can compare their parameter 
recovery capability. However, unlike synthetic data, the true 
parameters are not known in real data, so we need to use 
the reliability of parameter estimates in split-half compar- 
isons instead. Another strategy is to compare cross-measure 
predictions. The assumption is that if a model based on 
polytomous outcomes (Assistance Score) yields better accu- 
racy than a model based on binary outcomes (error rate) 
in predicting both polytomous and binary outcomes, the 
polytomous model will be demonstrated to be a better mea- 
surement model. This strategy is applicable in our scenario 
because there are connections between both outcomes. Since 
a student step is considered correct only when there is no 
assistance, the error rate can be derived by calculating the 
probability of Assistance Score = 0. On the other hand, 
we can convert the error rate to a probability of an Assis- 
tance Score by calculating the likelihood, where given an 
error rate p, the probability of having an Assistance Score 
a is (1— p)p*. Then we can use CVs on both measures to 
compare the models. 


4.1 Experiment 1: Synthetic Data 

In order to validate PC-AFM capability to recover student 
and KC parameters, we synthetically generate datasets of 
student steps based on a logistic regression model. Given a 
set of student and KC parameters together with an oppor- 
tunity count, a distribution over Assistance Scores is deter- 
mined. We then sample once from the distribution to gener- 
ate an Assistance Score of that student step. We generated 
6 datasets of varying numbers of students and KCs, of which 
the true student and KC parameters are known, to examine 
parameter recovery capacity of PC-AFM in comparison to 
AFM. In each generated dataset, student intercepts range 
from -2 to 2, KC intercepts range from -1 to 1, and KC 
slopes range from 0 to 0.5. The number of KCs ranges from 
8 to 32, and the number of students range from 25 to 200. 


We also evaluate both models with three types of cross- 


Table 1: Correlation between true and estimated parameters 
in synthetic data. 


Table 2: Correlation between split-halves parameters in syn- 
thetic data 


Dataset Stu Intercept KC Intercept KC Slope 


PC AFM PC AFM PC AFM 


KC8_S825 0.932 0.828 0.990 0.895 0.912 0.498 
KC8_S50 0.963 0.906 0.998 0.931 0.972 0.945 
KC8_S100 0.980 0.941 0.998 0.850 0.969 0.888 
KC8_S200 0.871 0.790 0.999 0.955 0.910 0.894 
KC16_S50 0.947 0.857 0.997 0.947 0.927 0.843 
KC32_S50 0.967 0.942 1.000 0.883 0.997 -0.345 


Dataset Stu Intercept KC Intercept KC Slope 
PC AFM PC AFM PC AFM 


KC8_S25 0.978 0.954 0.996 0.802 0.914 0.675 
KC8_S50 0.973 0.936 0.998 0.985 0.972 0.964 
KC8_S100 0.973 0.931 1.000 0.984 0.952 0.909 
KC8_S200 0.975 0.936 1.000 0.979 0.975 0.735 
KC16_S50 0.990 0.977 0.998 0.780 0.962 0.933 
KC32_S50 0.996 0.988 0.995 0.799 0.929 0.543 


validation (CV), Random (data points are split randomly), 
Student-blocked (data points are split by student), and Item- 
blocked (data points are split by item), to demonstrate if 
our model training on Assistance Score, can outperform a 
dichotomous model training on error rate in predicting di- 
chotomous outcomes. 


We report on results for each of six different synthetic datasets 
by comparing PC-AFM and AFM. We found that PC-AFM 
better recovers the true student and KC parameters than 
AFM in almost all comparisons using correlation (Table 1). 
All contrasts are the same using mean absolute error. As 
the number of students goes up, both models tend to better 
recover the true parameters. The correlations of parameters 
in split-half comparison are reported in Table 2, which show 
a similar pattern to the correlation between estimated and 
true parameters. This demonstrates that the parameter cor- 
relation in split-half comparisons, which can be computed in 
real data, is a reasonable proxy for true parameter recovery, 
which cannot be computed in real data. 


Figure 1 illustrates better true parameter recovery using 
Assistance Score and PC-AFM than using error rate and 
AFM. PC-AFM parameter estimates (red x’s) are generally 
accurate across the spectrum of known parameter values (x- 
axis), as can be seen by their closeness to the line, which is 
identity function (intercept of 0, slope of 1). AFM estimates 
(blue dots) are generally biased toward the extremes. For 
student intercepts (Figure 1a), low prior knowledge students 
are estimated by error rate/AFM to be worse than they are 
and high prior knowledge students are estimated to be better 
than they are. For KC intercepts (Figure 1b), hard KCs (on 
the left) are estimated by error rate/AFM to be even harder 
than are. For hard KCs, most responses are errors, yield- 
ing quite low estimates by error rate/AFM. But, these same 
steps show more variance in Assistance Score/PC-AFM as 
somewhat better students and higher opportunities will pro- 
duce lower, but non-zero Assistance Scores (i.e., not chang- 
ing in error rate). 


In error rate CV results, except Item-blocked CV where 
both models perform similarly, PC-AFM outperforms AFM 
in all other CVs (Table 4). Recall that these CV evalua- 
tions require PC-AFM, while fit to Assistance Score (poly- 
tomous outcome), to predict error rate (dichotomous out- 
come). When we turn the tables and compare methods on 
predicting Assistance Score, we find a similar pattern where 
PC-AFM yields better accuracy in most CVs (Table 3). 


4.2 Experiment 2: Real student data 
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Figure 1: Using Assistance Score and PC-AFM on synthetic data produces better estimates of the true parameters, for all three 
of student intercepts, KC intercepts, and KC slopes than does using error rate and AFM. 


Table 3: Cross-validation results (RSME) in synthetic data 


Table 5: Real Student Dataset. 


predicting Assistance Score in the test set by estimating pa- Dataset Domain #Stu | #Item | #KC 
rameters based on Assistance Score (PC-AFM) or on Error ds308 College Statistics 52 113 9 
Rate (AFM) in the training set. ds313 English articles 120 85 26 
Dataset Random Stu-Blocked = Item-Blocked ds372 English articles 99 84 15 
PC AFM PC AFM PC AFM ds388 | Middle School math | 318 64 64 
KC8_S25 0.546 0.598 0.542 0.600 0.586 0.634 ds392 Geometry 123 2035 43 
KC8_S50 0.544 «(0.599 0.541 0.601 0.575 0.610 ds394 English articles 97 180 13 
KC8_S100 0.536 0.596 0.532 0.599 0.550 0.602 
KC8_S200 0.541 0.597 0.537 0.600 0.541 0.597 
KC16_S50 0.540 0.600 0.537 0.601 0.566 0.604 rate CVs in most datasets, which suggests that PC-AFM can 
KC32_S50 0.540 0.587 0.539 0.590 0.579 0.626 achieve better estimates of KC parameters. To validate the 


In the second experiment, we examine PC-AFM across a 
variety of real world datasets. We used 6 datasets across 
different domains (statistics, English articles, algebra, and 
geometry) from the DataShop repository. Table 5 shows 
the number of students, items, KCs, total transactions for 
each dataset. For each dataset, we use the KC model that 
achieves the best BIC reported on the DataShop repository. 
All KC models coded a single KC per step. The number of 
KCs ranges from 9 to 64, and the number of students ranges 
from 52 to 318. 


For each dataset, we evaluated both PC-AFM and AFM on 
5 independent runs of 3-fold CVs of each type predicting 
both Assistance Score and error rate. We report the result 
of Assistance Score CVs in Table 6 and the results of error 
rate CVs in Table 7. We found that PC-AFM outperforms 
AFM in Student-blocked in both Assistance Score and error 


Table 4: Cross-validation results (RSME) in synthetic data 
predicting Error Rate in the test set by estimating parame- 
ters based on Assistance Score (PC-AFM) or on Error Rate 
(AFM) in the training set. 


hypothesis, we investigated split-halves parameters correla- 
tion of both models. We splitted the datasets on students to 
evaluate KC slopes and intercepts correlation, and we split- 
ted the datasets on KCs to evaluate students’ intercepts (Ta- 
ble 8). On average, PC-AFM yields better correlations of 
both KC intercepts (0.954 vs 0.946) and KC slopes (0.600 vs 
0.563), but correlations of student intercepts is significantly 
higher for AFM (0.784 vs 0.495). 


5. DISCUSSION 


Assistance score should, in principle, improve model param- 
eter estimates and predictions based on them. A student 
who gets a step correct after just one error or one hint (As- 
sistance Score = 1) is likely to be closer to full acquisition 
of a KC than a student who makes an error and requests 3 
hints (Assistance Score = 4). However, the error rate metric 
commonly used with BKT and AFM treats these the same, 
since the student was not correct on their first attempt at 
the step without a hint. Thus, there is potentially extra in- 


Table 6: Cross-validation results (RSME) in real data pre- 
dicting Assistance Score in the test set by estimating param- 
eters based on Assistance Score (PC-AFM) or on Error Rate 
(AFM) in the training set. 


Dataset Random Stu-Blocked Item-Blocked Dataset Random Stu-Blocked  Item-Blocked 
PC AFM PC AFM PC AFM PC AFM PC AFM PC AFM 

KC8_S25 0.275 0.278 0.310 0.306 0.370 0.430 ds308 0.376 0.376 0.381 0.378 0.384 0.388 
KC8_S50 = =0.273 «0.280 0.282 0.304 0.356 0.297 ds313 0.541 0.528 0.551 0.554 0.549 0.555 
KC8_S100 0.273 0.277 0.283 0.300 0.387 0.449 ds372 0.478 0.463 0.480 0.481 0.484 0.487 
KC8_S200 0.271 0.275 0.278 0.295 0.278 0.282 ds388 0.672 0.649 0.682 0.703 0.702 0.703 
KC16_S50 0.277 0.281 0.278 0.311 0.301 0.294 ds392 0.385 0.354 0.386 0.387 0.385 0.390 
KC32.S50 0.287 0.291 0.292 0.320 0.358 0.347 ds394 0.499 0.486 0.499 0.499 0.504 0.510 
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Table 7: Cross-validation results (RSME) in real data predict- 

ing Error Rate in the test set by estimating parameters based 

on Assistance Score (PC-AFM) or on Error Rate (AFM) in 

the training set. 

Dataset Random Stu-Blocked Item-Blocked 
PC AFM PC AFM PC AFM 


ds308 0.336 «0.326 §=0.332 =0.328 =0.341 ~—«0.339 
ds313 0.417 «0.408 )=—0.413) 0.440) «0.435 (0.424 
ds372 0.379 0.377 0.383 0.402 0.388 0.387 
ds388 0.454 0.421 0.439 0.470 0.501 0.456 
ds392. 0.324 «60.824 =0.325 0.333) 0.325) (0.325 
ds394. 0.395 «0.391 s:0.388)=— 0.418 =0.403 ~=—(0.403 


formation about students’ level of knowledge acquisition in 
the Assistance Score not present in error rate. On the other 
hand, prior research, for example on gaming the system [3], 
suggests there are other reasons students may produce re- 
peated incorrect entries or hint requests. These may pro- 
duce enough confounding variance to make using Assistance 
Score worse at accurate latent parameter estimation than 
using error rate. 


In developing a statistical model, PC-AFM, to convert As- 
sistance Scores to knowledge acquisition estimates, we first 
wanted to confirm that PC-AFM works as intended and is 
able to benefit from extra information in Assistance Score 
when no confounding sources for Assistance Score variation 
are present. Indeed, when we generate synthetic data where 
Assistance Scores are stochastically produced from known 
latent parameters, we demonstrate better parameter recov- 
ery using Assistance Score and PC-AFM than using error 
rate and AFM. As shown in Figure 1, PC-AFM estimates 
of student parameters are better correlated with true param- 
eters and the AFM estimates are baised at the extremes. 


This parameter recovery method for comparing these two 


different measurement models cannot be applied to real datasets 


because the true parameters are unknown. Thus, we em- 
ployed we explored two other approaches: parameter esti- 
mate reliability and our novel cross-measure cross-validation 
approach. We demonstrated better parameter estimate re- 
liability (in split-halves comparisons) using PC-AFM than 
AFM. We also show how it is possible to use cross-measure 
predictions to evaluate which of two different measurement 
models works better, call them M1 and M2. We show that 
estimating based on M1 (e.g., assistant score) can predict 
M2 (e.g., error rate) on held-out data better than estimat- 
ing based on M2 itself (e.g., error rate). We believe this 
cross-measure cross-validation is a novel approach for com- 
paring measurement models. 


Assessing whether Assistance Score is a better measure than 
Error Rate in real student data is complicated in two ways. 
First, we do not have access to the true parameters in real 
datasets, so we turn to measures of reliability and predictive 
validity. Second, we know from models of gaming the sys- 
tem and help seeking that students may produce Assistance 
Scores for motivational and metacognitive reasons that are 
potentially independent of a mastery source. In other words, 
Assistance Scores have a student-driven source of variation 
that may reduce their effectiveness in estimating student 


Table 8: Split-halves parameters correlation in real data. 
Dataset Stu Intercept KC Intercept KC Slope 
PC AFM PC AFM PC AFM 


ds308 0.113 0.486 0.971 0.955 0.745 0.583 
ds313 0.490 =0.830 «0.948 0.937 0.865 0.905 
ds372 0.427 0.803 0.985 0.968 0.433 0.639 
ds388 0.567 0.873 0.946 0.945 0.225 0.354 
ds392—-0.830 S—-:0.901 (0.973 0.964 0.494 0.485 
ds394. 0.541 =—(0.809)S-0.904 0.906 0.838 0.413 


mastery. We hypothesize that our model is struggling to 
estimate student parameters in the real-world datasets due 
to variance in students’ help seeking behavior. 


We found that in real world datasets PC-AFM can better es- 
timate KC parameters than AFM, which results in PC-AFM 
outperforming AFM in Student-blocked CVs. KC parame- 
ters estimates significantly impact Student-blocked CVs be- 
cause they are the sole driver of these predictions. Poor stu- 
dent estimates do not impact Student-blocked CVs because 
they are not carried from the training to test as blocking 
means there are different students in the test than training. 
It does impact Random CVs and Item-blocked CVs because 
they are likely to have some students showing up in both 
test and training. 


6. CONCLUSION AND FUTURE WORK 


We investigated whether or not Assistance Score provides 
a better measurement model than error rate for estimating 
student’s ability. To pursue this question, we developed a 
statistical model, PC-AFM, that utilizes Assistance Score. 
We also faced the more general problem of how to compare 
alternative measurement models for the same desired latent 
outcome. In typical model comparison the predicted out- 
come measure stays the same, but such comparison does not 
work when the outcome measures are different. We proposed 
two strategies to tackle this problem: parameter estimate re- 
liability in split-halves comparisons and a new approach we 
call, cross-measure cross-validation. We demonstrated that 
these strategies work well by using synthetic data to show 
that a model that better recovers parameters will also yield 
better results with these strategies. 


We demonstrated that PC-AFM outperforms AFM when 
Assistance Scores are synthesized to be meaningful, but its 
performance is hindered by non-ability variance in students’ 
behavior in the real-world datasets. Future work can explore 
this finding by synthesizing Assistance Scores that derive 
from both ability and motivational factors. 


Future work can also test our measurement model compar- 
ison strategies. For example, while it has been standard 
practice in many tutoring systems to count hints as errors 
(M1), some have wondered whether it would be better to not 
count hints as errors (M2). Our measurement model com- 
parison techniques, split-half reliability and cross-measure 
cross-validation, can be used to compare M1 and M2 to in- 
fer which provides better estimates of student ability. 
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ABSTRACT 


There has been great progress towards Reinforcement Learn- 
ing (RL) approaches that can achieve expert performance 
across a wide range of domains. However, researchers have 
not yet applied these models to learn expert models for ed- 
ucationally relevant tasks, such as those taught within tu- 
toring systems and educational games. In this paper we 
explore the use of Proximal Policy Optimization (PPO) 
for learning expert models for tutoring system tasks. We 
explore two alternative state and action space representa- 
tions for this RL approach in the context of two intelligent 
tutors (a fraction arithmetic tutor and a multicolumn addi- 
tion tutor). We compare the performance of these models to 
a computational model of learning built using the Appren- 
tice Learner architecture. To evaluate these models, we look 
at whether they achieve mastery and how many training op- 
portunities they take to do so. Our analysis shows that at 
least one PPO model is able to successfully achieve mas- 
tery within both tutors, suggesting that RL models might 
be successfully applied to learn expert models for educa- 
tionally relevant tasks. We find that the Apprentice model 
also achieves mastery, but requires substantially less train- 
ing (thousands of times less examples) than PPO. Finally, 
we find that there is an interaction between the PPO rep- 
resentation and task (one representation is better for one 
tutor and the other representation is better for the other 
tutor), suggesting that the design of the state and action 
representations for RL is important for success. Our work 
showcases the promise of RL for expert model discovery in 
educationally relevant tasks and highlights limitations and 
challenges that need further research to overcome. 
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Reinforcement Learning, Simulated Students, Expert Model 
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Researchers have made great progress towards developing 
Reinforcement Learning (RL) models that can meet or ex- 
ceed human skill at complex tasks across a broad range of 
domains. For example, the recently developed Proximal Pol- 
icy Optimization (PPO) algorithm can learn to play a 
broad range of Atari games at an expert level through trial 
and error. A team of five PPO-trained models can beat a 
team of five human professional champions at DOTA2, a 
collaborative online multiplayer battle arena game [4]. Re- 
searchers have applied a related RL approach called A3C 
to develop agents that can beat top human experts at 
Starcraft2, a multiplayer real-time strategy game [28]. Fi- 
nally, RL has been applied widely to the area of robotics 
and autonomous systems; e.g., RL models can fly an F16 to 
beat a expert human pilots in simulated 1v1 dogfights [7]. 


Despite these successes, there has been surprisingly little 
work exploring the applicability of RL to educationally rel- 
evant tasks, such as those found in K12 or higher educa- 
tion. We do not mean that RL has not been applied to sup- 
port learning and education; in fact, there is a large amount 
of work exploring how RL can be applied to optimize stu- 
dents instructional sequences [9] However, we assert that 
there has been very little exploration of how emerging RL 
approaches perform on the kinds of educationally relevant 
tasks that humans often engage in; e.g., learning math. 


Given this gap we might ask, what are the benefits of ap- 
plying RL to educationally relevant tasks? The recent work 
on computational models of learning highlights many pos- 
sible benefits. First, machine learning agents can support 
researchers and instructional designers in authoring cogni- 
tive models and discovering knowledge component 
models that can drive personalized learning technolo- 
gies. Although RL models utilize different representations 
than more traditional expert-system models (e.g., statistical 
representations), learned models do still represent an expert 
model|*| Thus, tutors might apply these models to provide 
feedback on student behaviors. Researchers and designers 
might also use machine learning agents to cognitively crash 
test instruction before more costly human trials [15] [31]. 


Given these benefits, why has more work not explored the 
use of recent RL methods for these tasks? One possibility is 
that applying RL to educational tasks is not straightforward. 
There exist toolkits, like GymAI 5], MuJoCoEnv [23], and 
PyBullets i6|, for interfacing RL algorithms with simulation 


‘An expert model maps states to correct next action(s). 
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environments and games platforms like Atari, StarCraft2, 
and DOTA2. These toolkits have powered the explosion of 
RL research. Unfortunately, no such interfaces exist for ed- 
ucational tasks, such as those found in intelligent tutoring 
systems or educational games. Also, many educational tasks 
do not fit cleanly into the standard RL paradigm of observ- 
ing the state, choosing an action, receiving a reward, and 
repeating; e.g., many tutors let learners request hints and 
worked examples, which RL systems cannot leverage[’| 


Beyond these challenges, it is possible that the tasks them- 
selves are less attractive for RL research. For example, 
educationally relevant tasks, such as fraction arithmetic, 
seem simple when compared to tasks such as flying an F16 
or playing DOTA2. It is possible that these tasks do not 
challenge current RL systems. However, it is also possible 
that these tasks present challenges that have prevented re- 
searchers from successfully applying RL to them. For exam- 
ple, tutor tasks often have much larger action spaces than 
game-based tasks where RL has been successfully applied 
(e.g., actions for inputting the numbers 1-50 in any inter- 
face field vs. six buttons on an Atari controller). 


In this paper, we set out to investigate these ideas and to 
lay a foundation for future research programs to apply RL to 
the kinds of educationally relevant tasks that humans regu- 
larly engage with. Our goal is not to show that RL provides 
a good model of human learning and behavior (we do not 
think that it does). Instead, we simply aim to show how RL 
methods might be applied to tasks relevant to human learn- 
ing. Our hope is that RL approaches might offer new means 
for authoring and evaluating educational technologies. 


To support these investigations we present TutorGym, an 
open-source toolkit for interfacing RL agents (as well as 
other kinds of machine learning agents) with intelligent tu- 
toring system tasks. This toolkit lets us apply RL models 
to two educational environments: a fraction arithmetic tu- 
tor and a multicolumn addition tutor. We developed two 
PPO models that vary in their features and action repre- 
sentations. We also compared these models to a previously 
developed Apprentice Learner model [16], which is a more 
cognitively inspired model of how people learn from exam- 
ples and feedback within intelligent tutors. We conducted a 
factorial study design where we applied these three models 
to our two educational tasks. Our key findings are: 


1. The PPO models are able to achieve mastery at these 
tasks, suggesting that they do generalize from games 
and robotics tasks to educationally relevant tasks; 


2. The PPO models require much more training than our 
Apprentice Learner model to achieve mastery (thou- 
sands of times more training), even when we provide 
PPO with the same background knowledge as Appren- 
tice (the PPO-Operator variant). This suggests that 
human-like models, such as Apprentice, are more effi- 
cient than PPO. 


3. We find that there is an interaction between PPO’s 
representation and the task, suggesting that represen- 
tation is central to RL performance and that it needs 
to be tailored for each task. 


Inverse RL can learn from expert examples, but typi- 
cally this is done offline in batch rather than interactively 
interleaved with RL. 


We claim that there is an synergistic opportunity to do re- 
search at the intersection of RL and education that has not 
yet been fully explored and this paper aims to lay the foun- 
dation for these future explorations. There are many poten- 
tial ways that educational data mining and learning analytic 
communities might benefit from the development and use of 
RL models, such as PPO. Similarly, there is an opportunity 
to improve RL by exploring its application to the kinds of 
educationally relevant learning tasks that humans engage in 
during K12 and higher education. 


2. BACKGROUND 


2.1 OpenAI Gym 


OpenAI Gym is an open-source toolkit for RL development 
5]. Gym provides an standardized interface for applying RL 
to tasks. An environment created with Gym has standard- 
ized state and action descriptions and supports methods, 
for querying the state, taking an action, and collecting re- 
wards. Gym currently supports multiple environments such 
as robot simulations or Atari games. Our research builds on 
Gym, so that we can directly interface existing RL imple- 
mentations with educationally relevant tasks without having 
to create custom implementations. 


2.2 Proximal Policy Optimization (PPO) 

PPO is a deep RL algorithm that was recently developed by 
OpenAI [25]. It is a policy gradient method that achieves 
state-of-the-art performance across many tasks. We chose 
PPO over alternatives, such as TRPO and ACER P29], 
because it supports a broad range of state and action rep- 
resentations and is much easier to tune than alternatives. 
For this work, we use the stable-baselines3 implementation 
of the PPO algorithm, which has verified performance on 
multiple RL benchmarks [22]. 


2.3 Apprentice Learner Architecture 

The open-source Apprentice Learner Architecture gen- 
eralizes prior simulated student models and provides 
a platform for investigating and comparing alternative sim- 
ulated student models. Apprentice models have been suc- 
cessfully applied to learn expert models for 8 different tutor 
tasks spanning multiple domains (math, language, chem- 
istry, and engineering) [17]. Emerging work explores the use 
of Apprentice models for supporting domain experts, such as 
teachers, in authoring tutors through teaching rather than 
programming [30]. In this work, we use one of the stan- 
dard Apprentice models as a baseline for evaluating the PPO 
models because it have been successfully applied in previous 
work to learn expert models for educationally relevant tasks. 
For a complete description of this model see [17]. 


3. TUTORGYM 


To support the development of machine learning agents we 
created TutorGym, a toolkit that provides a machine inter- 
faces for multiple tutor environments|’] TutorGym leverages 
the Gym to enable existing RL implementations (that 
support Gym) to interface with these environments. 


Our toolkit extends Gym to enable agents to request worked 
examples. Tutors generate both next-step hints and feed- 
back, so the examples are automatically generated by the 


3We have open-sourced TutorGym under an MIT license 
and made it publicly available here: https://tutorgym.ai 
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Initial State 


{ initial_num_left: “5”, 
initial_ denom_left: “4”, 
initial_operator: “+”, 
initial_num_right: “14”, 


initial_denom_right: “10”, 


check_convert: “”, 
convert_num_left: “”, 
convert_denom_left: “”, 


{ initial_num_left: “5”, 
initial_ denom_left: “4”, 
initial_operator: “+”, 
initial_num_right: “14”, 


initial_denom_right: “10”, 


check_convert: “x”, 
convert_num_left: “50”, 


convert_denom_left: “40”, 


Initial State 


an 


{ thousands_carry: “”, 


hundreds_carry: “”, 
tens_carry: “”, 
upper_hundreds: “2”, 
upper_tens: “7”, 
upper_ones: “1”, 
lower_hundreds: “7”, 
lower_tens: “6”, 
lower_ones: “2”, 
operator: “+”, 


Final State 


{ thousands_carry: “1”, 


hundreds_carry: “1”, 
tens_carry: “”, 
upper_hundreds: “2”, 
upper_tens: “7”, 
upper_ones: “1”, 
lower_hundreds: “7”, 
lower_tens: “6”, 
lower_ones: “2”, 
operator: “+”, 


convert_num_right: “”, convert_num_right: “56”, 
convert_denom_right: “”, convert_denom_right: “40”, 
answer_num: “”, answer_num: “106”, 
answer_denom: “” } answer_denom: “40” } 


Figure 1: Fractions tutor, as rendered within TutorGym with 
its underlying base feature representation. 


underlying tutor. As RL models only learn from feedback on 
actions, these interactions are mainly used by the Appren- 
tice models, which learn from both examples and feedback. 


TutorGym logs agent interactions in DataShop format [12], 
which is a common educational data format. Outputting 
data in this format lets us analyze it using the same tech- 
niques used to analyze human tutor data. In particular, it 
lets us conduct learning curve analysis to investigate agents’ 
first attempt correctness as they receive more practice. 


We implemented two tutors within TutorGym: a fraction 
arithmetic tutor and a multicolumn addition tutor [30]. 
We chose these tutors because they exhibit interesting state 
and action spaces characteristics that are relevant to our 
analysis of emerging RL approaches. 


3.1 Fraction Arithmetic Tutor 

This tutor was used to study both human and agent 
learning. It presents students with three kinds of fractions 
problems: addition with the same denominator, addition 
with different denominators, and multiplication. The stu- 
dents check a box indicating whether they need to convert 
to common denominators before solving. If the fractions 
need to be converted, then they input values into the con- 
version fields. The tutor requires students to convert frac- 
tions to common denominators using cross multiplication. 
If students do not need to convert, then they directly en- 
ter values into the final fraction fields. Figure [1] shows the 
visual representation of the tutor state generated by Tutor- 
Gym along with the underlying attribute-value representa- 
tion that it maintains internally. The tutor gives students 
randomly generated problems where the initial numerators 
range from 1-15, the initial denominators range from 2-15, 
and the type of problem can be either addition or multipli- 
cation. There is also an “easy” version of the tutor that gen- 


answer_thousands: “1”, 
answer_hundreds: “0”, 
answer_tens: “3”, 
answer_ones: “3” } 


answer_thousands: “”, 
answer_hundreds: “”, 
answer_tens: “”, 
answer_ones: “” } 


Figure 2: Multicolumn tutor, as rendered within TutorGym 
with its underlying base feature representation. 


erates a much smaller range of numbers (numerators range 
from 1-5 and denominators range from 2-5). The tutoring 
system also has a done button (not shown) that the agent 
can select and it can provide worked examples on request. 


3.2. Multicolumn Addition Tutor 


The second tutor was used in previous research on simulated 
students [30]. It presents students with two numbers to add, 
with each digit presented in its own field. To compute the 
solution, the students needs to add and carry values where 
necessary. The tutor requires students to enter the answer 
values right to left, carrying where necessary. The tutor will 
mark an answer incorrect if they have not yet filled in the 
answer field to the right or they have not yet carried over 
a value from the previous column (if required). Figure 
shows a simple visual output created by TutorGym along 
with the underlying attribute-value representation. The tu- 
tor also has a done button (not shown) and can provide 
worked example on demand. 


4. LEARNING MODELS 


4.1 Apprentice Learner 

We created three alternative learning models to train within 
the TutorGym tutors. We built our first model using the 
Apprentice Learner architecture [30]. From this archi- 
tecture, we used the an Apprentice model developed in prior 
work [17]. For each tutor, we provided the apprentice model 
with background relational knowledge (for augmenting the 
state description) and primitive operators (for explaining 
demonstrations). For the fractions tutor, we provided equal- 
ity knowledge, which adds features to the state description 
for each pair of fields denoting whether they have equal val- 
ues. We also provided three primitive operators: copy, add, 
and multiply, which give the agent the ability to copy, add, 
and multiply values from the interface. 
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For the multicolumn tutor, the knowledge was slightly more 
complicated. We added four relational knowledge operators: 
add2-ones, add2-tens, add3-ones, and add3-tens. The first 
two add the values from every pair of fields in the interface 
and add features to the state denoting the ones component 
and tens component of each sum. Add3-ones and add3-tens 
do the same, but for every triplet of fields. This provides the 
agent with the ability to determine if a column of numbers 
(either of length two or three) will generate a value that 
needs to be carried or not. We also added these exact same 
operators as primitive operators, so the agent can use them 
to explain and perform the actual steps of computing each 
column sums and generating the appropriate carry values. 


4.2 PPO-Number 

A PPO model is defined in terms of its state (input) and 
action (output) representations. For the fractions and mul- 
ticolumn tutors, the PPO-Number model makes use of the 
base state-representations shown in Figures[]]and[2] To con- 
vert these representations into a format that is acceptable 
to PPO (a fixed-length feature vector), we used an approach 
called one-hot encoding. Under this scheme, every unique 
attribute-value pair from the state is mapped to a particular 
feature in a feature vector. If the attribute-value is present 
in the state, then the feature is a 1, otherwise it is a 0. 


Unfortunately, precomputing all possible attribute-values is 
non-trivial. To address this issue, we created an online 
one-hot encoder that always outputs a vector with a fixed 
length preselected by the user. Whenever the encoder en- 
counters a new attribute-value, it maps it to a previously 
unused feature within the vector. After a mapping between 
an attribute-value and a feature has been made, that fea- 
ture is only ever used to represent that particular attribute- 
value. This scheme enables the use of RL approaches that 
expect fixed-length vectors (PPO) even though the system 
might encounter a large number of sparse features that are 
not known in advance. The end result is that states from 
the fraction and multicolumn tutors are mapped to fixed 
length feature vectors, where every feature is either a 0 or a 
1 (e.g., the initial state from F igure [2] would have a feature 
for upper_tens = 7 with a value of 1). For the fractions 
tutor, 2000 features was sufficient to describe states in the 
standard tutor and 900 features was sufficient to describe 
the states in the easy tutor (with a smaller set of problems). 
For the multicolumn tutor, 110 features were sufficient. 


Given this state representation, PPO-Number utilizes a mul- 
tidiscrete output. This type of output has multiple indepen- 
dent discrete action outputs; e.g., in Atari it might have an 
output for the arrow pad (left, right, up, down) and an- 
other output for the action action buttons (A, B, or None). 
PPO-Number also has two outputs: one that outputs a field 
to enter a value into (e.g., answer_num) and a second that 
outputs a number to enter into that field (e.g., 1). 


For the fractions tutor, there are eight fields that can be 
selected for input and there are 450 possible numbers that 
can be entered into one of these fields (1-450). For the easy 
version of the tutor, there are only 50 possible numbers (1- 
50). Taken together, this means that the standard tutor has 
3,600 unique actions (8 x 450 = 3600) and the easy tutor has 
400 unique actions (8 x 50 = 400). There are slightly less 
actions in practice because outputs are ignored in certain 


cases; if the system selects the done or the check_convert 
fields, than the number component is ignored. 


For the multicolumn tutor, there are also eight possible fields 
that can be selected. Each field represents a single digit, 
so there are only 10 numbers that can be input into each 
field (0-9). This yields a total action space of 80 actions 
(8 x 10 = 80). Similar to fractions, the total is slightly less 
in practice because the number component of the output is 
ignored when the done button is selected. 


4.3 PPO-Operator 


This model uses a different state and action representation 
from PPO-Number. The representation aims to mirror the 
representation used by the Apprentice model. We apply 
the relational knowledge used by Apprentice to augment the 
base state representation from each tutor. In the fractions 
tutor, we apply the equality relation to add an additional 
feature describing which pairs of fields are equal. For the 
multicolumn tutor, we apply the add2-one, add2-tens, add3- 
ones, and add3-tens relations to compute the ones and tens 
values for the sums of every unique pair and triple of values 
from the tutor fields. We applied the same one-hot encoding 
approach used for PPO-Number to convert attribute-values 
into fixed-length feature vectors. We increased the size of 
the feature vectors to support the combinatorial number of 
additional relational features (2000 for fractions and 5000 
for multicolumn). 


The action space is multidiscrete, but the number and type 
of outputs are slightly different from PPO-Number. For the 
fractions tutor, the model has four outputs. The first is 
similar to PPO-Number’s selection output, it identifies the 
field to update with a result. There are eight possible fields 
that might be updated by an action. The second output 
corresponds to an operator to apply. The operators are the 
same as those available to Apprentice: copy, add, or multi- 
ply. The remaining two outputs correspond to fields in the 
interface that provide the two argument for each operator 
and there are ten possible fields that can be used for either of 
these arguments. Using this scheme, an agent might choose 
to update the answer_num field using the add operator, and 
it might provide the initial num_left and initial num_right 
as arguments. There are 2400 possible unique actions under 
this representation (8 x 3 x 10 x 10 = 2400). However, this 
number is smaller in practice. If the model chooses to update 
the done or check_convert fields than the reset of the action 
outputs are ignored. Additionally, if the model chooses to 
use the copy operator, than only the first argument is used 
(the second is ignored). 


For the multicolumn tutor there are five outputs instead 
of four. The first corresponds to a field to update (there 
are eight possible fields). The second corresponds to the 
operator to apply. There are five operators corresponding to 
those used by Apprentice: copy, add2-tens, add2-ones, add3- 
tens, add3-ones. Finally, there are three argument fields 
because some of the operators (add3-tens and add3-ones) 
take three arguments. There are 13 possible options for 
each argument. With these outputs, there are 87,880 unique 
actions (8x 5x13 13x13 = 87880). In practice this number 
is much smaller because if the done field is updated, then all 
the other outputs are ignored. Similarly if the copy operator 
is selected, than only the first argument is used (second and 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 305 


Model Domain | # Inputs | # Discrete Outputs 
Number Fractions 2000 8, 450 
Number | Fractions-Easy 900 8, 50 
Number Multicolumn 110 8, 10 
Operator Fractions 2000 8, 3, 10, 10 
Operator | Multicolumn 5000 8, 5, 13, 13, 18 


Table 1: Size of PPO model input/output for each task. 


third arguments are ignored). Finally, if the add2-tens or 
add2-ones operator are selected, than the third argument is 
ignored. Table [I] shows a summary of the number of inputs 
and outputs for each model and tutor. 


5. SIMULATION STUDY 
5.1. Tuning and Training Models 


We conducted a simulation study with a factorial design, 
where every agent (Apprentice, PPO-Number, and PPO- 
Operator) was trained in each environments (fractions and 
multicolumn). The hyperparameters used by PPO greatly 
affect its performance and they must be tuned inde- 
pendently for each model and task. We used Optuna, an 
open-source hyperparameter optimization framework to au- 
tomate hyperparameter search 2]. Using Optuna, we ran 
approximately 100 iterations of hyperparameter tuning for 
each PPO model and task pair. Tuning one model for one 
task took approximately 38 hours. The Apprentice model 
does not have any hyperparameters that need to be tuned. 


We trained each model in each environment using the best 
hyperparameters. We trained Apprentice on 500 fractions 
problems and 5000 multicolumn problems. These amounts 
provided enough practice to reach mastery while minimizing 
unnecessary computation. We trained each PPO model for 
1 million steps, which translates into a varying number of 
problems depending on the amount of incorrect steps. To 
analyze the simulation logs, we assigned knowledge compo- 
nent labels to each field for each problem type (e.g., answer 
one’s place for multicolumn), computed the first-attempt 


Model 


= AL 


0.80- = PPO-Number-Easy 
0.75- 

0.70- PPO-Operator 
0.65 - 

0.60 


error 
°o 
a 
o 
7 


° 
> w w 
° 
ee ea 


-0.05- + ' ' ' ' 
0 1 2 3 4 


log10(Opportunity..field.) 


oe 


Figure 3: Fraction arithmetic learning curves. 


correctness on each knowledge components for each prob- 
lem, and plotted this correctness on a log scale with values 
smoothed using binomial Gaussian additive smoothing (to 
account for the 0/1 nature of the correctness values). 


5.2 Results 


See Appendix [A] for the results of hyperparameter tuning. 
During tuning, we were unable to get PPO-Number to con- 
verge to a correct model in the fractions tutor. We hy- 
pothesized this was due to the large number of actions for 
this model/task. To test this hypothesis, we trained PPO- 
Number on the easy fractions tutor, which has substantially 
less actions. PPO-Number converged to correct behavior on 
this tutor, supporting our hypothesis. 


Figures |3] and |4| shows the learning curves for the different 
models in the two tutor environments. We find that Ap- 
prentice converges to mastery after 10 opportunities for each 
knowledge component in the fractions tutor and 125 in the 
multicolumn tutor. In contrast, PPO-Number requires over 
10,000 opportunities to reach mastery on the easy fractions 
tutor and over 10,000 practice opportunities to reach mas- 
tery in the multicolumn tutor. PPO-Operator requires less 
opportunities (3,000) within the fractions tutor , but never 
quite reaches mastery within the multicolumn tutor, even 
after 10,000 opportunities. Even though both PPO-Number 
and PPO-Operator receive the amount of training steps (1 
million), PPO-Operator makes more mistakes per problem 
and receives less problems as a result. 


5.3. Discussion 

At least one PPO model was able to achieve mastery in each 
tutor. PPO-Operator achieved mastery in the fractions tu- 
tor and PPO-Number achieved mastery in the multicolumn 
tutor. This suggests that PPO can generalize from game 
and robotics tasks to tutor tasks. However, the finding that 
no single representation is best suggests that the represen- 
tations must be customized for each task. 


PPO-Number was unable to master the standard fractions 
tutor. We suspect this is due to the single output channel 
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Figure 4: Multicolumn addition learning curves. 
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with 450 actions. Based on our experience, model perfor- 
mance degrades when the number of actions on one of the 
multidiscrete outputs gets large. Future work should ex- 
plore replacing the 450 action output with three outputs: a 
hundreds digit output (0-4) a tens digit output (0-9) and a 
ones digit output (0-9). We also found that PPO-Operator 
was unable to achieve mastery in the multicolumn tutor. 
It may achieve mastery with more training (e.g., 1.5 mil- 
lion training steps rather 1 million). Assuming this is true, 
then PPO-Operator, which uses the same relational and op- 
erator knowledge as Apprentice seems to be more generally 
applicable than PPO-Number. Apprentice achieves mastery 
in both tasks with substantially less practice (thousands of 
times less), suggesting Apprentice has more efficient learn- 
ing. Apprentice models have been shown to have similar 
learning curves to human students for the fractions tu- 
tor. This implies that PPO models require substantially 
more training than human learners. 


One limitation of the current study is that PPO may be more 
efficient when trained with multiple simultaneous environ- 
ments (e.g., 8 tutors in parallel). Parallel training provides 
more diversity and improves learning. We tested this idea 
by training PPO models on 8 parallel environments for both 
the fractions and multicolumn tutors. We found that paral- 
lel PPO required an equivalent amount of practice to achieve 
mastery as non-parallel PPO; however, parallel PPO took 
less total wall time to train (e.g., 12 instead of 48 hours). 
Future work should explore the benefits and trade-offs of 
parallel training. Also, PPO is an on-policy RL approach, 
as opposed to an off-policy approach like Deep Q-Networks 
(DQN) [20]. As such, PPO only trains on the data that is 
immediately sampled from the environment; it discards old 
training data because it will cause the model to diverge. In 
contrast, DQN saves all experiences and continues to train 
on them over the course of learning. We would have liked to 
compare PPO to DQN, but DQN does not support multi- 
discrete action outputs, so could not be evaluated using the 
current Number and Operator representations. Future work 
should explore modifications of DQN (or other off-policy 
models) to see how they perform on these tasks. 


6. RELATED WORK 


There has been substantial work exploring the use of RL to 
optimally sequencing students’ practice [9]. Unfortunately, 
this approach requires a large amount of data. One solu- 
tion is to train models using simulated student data. How- 
ever, simulated student models are often simplistic and not 
representative of real student behavior (3). As a result, se- 
quencing models built from synthetic data typically perform 
poorly with human students. The RL models we propose 
might serve as better simulated student models. Adopting 
a rational analysis perspective 1a], we hypothesize that 
agents that face the same task and processing constraints as 
humans will have similar behavior; i.e., sequencing models 
that are best for agents should be best for humans. However, 
future work is needed to investigate this hypothesis. 


A similar parallel hypothesis is that when the task and pro- 
cessing constraints between RL and humans differ, the their 
behavior is likely to differ. To investigate this idea, Stamper 
et al. explore differences in human vs. RL expertise for 
two games: Connect Four and Space Invaders. We view our 


work as complementary to this research, and future work 
should compare the behavior of expert models learned in 
this work to the behavior of human experts. 


Simulated students have been used for a wide range of appli- 
cations including theory testing , expert model authoring 
17} [30], and teachable agents . However, we are unaware 
of previously developed simulated students that make use of 
RL. Some of this prior work aims to model human learn- 
ing and behavior. In contrast, this work makes little effort 
to model humans. We view this as a shortcoming of our 
current study, due to its preliminary nature. Future work 
should explore how RL approaches, such as those explored 
here, might be integrated within human-like simulated stu- 
dent models, such as Apprentice. 


7. FUTURE WORK 


TutorGym and our initial PPO models lay the foundation 
for a number of novel research directions. One promising 
directions we hope to explore concerns the use of RL to dis- 
cover buggy student knowledge. During learning, RL agents 
make many mistakes. We should explore how these mistakes 
relate to the kinds of mistakes that humans make. VanLehn 
investigated the “mind bugs” that human students ex- 
hibit in multicolumn arithmetic. Future work should explore 
how RL bugs compare to human bugs and if RL can support 
the discovery of bug knowledge for tutor tasks. 


8. CONCLUSIONS 


We explore the application of the PPO—an emerging RL 
approach—to educationally relevant tasks. While RL has 
been successfully applied to learn expert models across many 
tasks and domains, it has not yet been applied in the con- 
text of educationally relevant tasks. To support this explo- 
ration, we created TutorGym, a toolkit for interfacing RL 
models with educational training environments. We created 
two tutor-based environments within TutorGym: a fraction 
arithmetic tutor and a multicolumn addition tutor. 


We created two PPO models that differ in their state and 
action representations (PPO-Number and PPO-Operator). 
For comparison purposes, we created a simulated student 
model using the Apprentice Architecture that has a similar 
state and action representation to the PPO-Operator model, 
but uses different (non-RL) learning mechanisms that are 
specifically designed to model human learning. We con- 
ducted a factorial study that varied the model and task. 
We found that at least one PPO model is able to achieve 
mastery within each tutor, suggesting that PPO is appli- 
cable to educationally relevant tasks. Despite this success, 
we found that both PPO models require substantially more 
training to reach mastery than Apprentice. This suggests 
that educationally relevant tasks present an interesting use 
case for the study and advancement of RL research. We also 
found an interaction between the type of PPO model and the 
task (PPO-Operator is best for fractions, but PPO-Number 
is best for multicolumn). This suggests that PPO’s repre- 
sentation affects its performance and must be customized 
specifically for each task. 


This work lays the foundation for future research to study 
and develop RL approaches for educationally relevant tasks. 
Our hope is that TutorGym and our initial models enable 
new research into how RL can support human education. 
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PPO-Number | PPO-Operator | PPO-Number | PPO-Operator 
Fractions-Easy Fractions Multicolumn Multicolumn 
n_steps 1024 256 32 128 
batch_size 512 32 32 64 
gamma 0.0 0.0 0.0 0.0 
learning_rate 4.75e-3 4.56e-5 1.43e-3 7.13e-4 
Ir_schedule constant constant linear constant 
entropy_coef 6.27e-3 3.28e-2 4.21e-2 2.91e-3 
clip_range 0.2 0.1 0.2 0.4 
n_epochs 1 10 5 1 
gae_lambda 0.98 0.99 0.92 1.0 
max_grad_norm 0.8 0.5 0.7 0.3 
vf_coef 0.915 0.240 0.401 0.568 
net_arch small tiny medium small 
shared_arch False True False True 
activation_fn tanh tanh relu tanh 


Table 2: PPO hyperparameters identified using hyperparameter optimization. 


APPENDIX 
A. HYPERPARAMETER TUNING 


For each tuning trial, Optuna selects hyperparameter from 
a prior sampling distribution, trains the model using these 
values, and measures the resulting performance. Within a 
trial, the PPO model is trained for 350,000 steps. The fi- 
nal model performance is used to update the hyperparam- 
eter sampling distribution, so subsequent iterations sample 
more promising hyperparameters. Optuna also implements 
a sample pruner, which detects PPO trials that are under 
performing (e.g., if PPO performance gets worse with train- 
ing rather than better) and prunes these samples early. 


Table [2] shows the hyperparameters that were identified by 
Optuna for each PPO model and domain. The hyperparam- 
eter values are not particularly interpretable, but we report 
them here so other researchers can replicate our results. It 
is worth noting that we manually fixed the gamma value at 
0.0, since tutoring systems provide immediate reward and 
future rewards do not need to be factored into decision mak- 
ing. Additionally, the tiny net architecture used a neural 
network with two layers and 32 nodes per layer, the small 
network used 64 nodes per layer and the medium network 
used 128 nodes. If the architecture was shared, then the sec- 
ond layer of the network was shared by both the value and 
the policy head of the network. However, if they were not 
shared then there were separate second layers for the value 
and policy heads. Finally, a constant Ir_schedule means that 
the learning rate is held constant over the course of train- 
ing, whereas a linear schedule means that learning rate is 
decreased linearly towards 0 over the course of training. 
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ABSTRACT 

In Intelligent Tutoring Systems (ITS), methods to choose 
the next exercise for a student are inspired from generic rec- 
ommender systems, used, for instance, in online shopping 
or multimedia recommendation. As such, collaborative fil- 
tering, especially matrix factorization, is often included as a 
part of recommendation algorithms in ITS. 


One notable difference in ITS is the rapid evolution of users, 
who improve their performance, as opposed to multimedia 
recommendation where preferences are more static. This 
raises the following question: how reliably can we use matrix 
factorization, a tool tried and tested in a static environment, 
in a context where timelines seem to be of importance. 


In this article we tried to quantify empirically how much in- 
formation can be extracted statically from datasets in edu- 
cation versus datasets in multimedia, as the quality of such 
information is critical to be able to accurately make pre- 
dictions and recommendations. We found that educational 
datasets contain less static information compared to multi- 
media datasets, to the extent that vectors of higher dimen- 
sions only marginally increase the precision of the matrix 
factorization compared to a 1-dimensional characterization. 
These results show that educational datasets must be used 
with time information, and warn against the dangers of di- 
rectly trying to use existing algorithms developed for static 
datasets. 


Keywords 
Knowledge tracing, Recommender systems, collaborative fil- 
tering, static models, matrix factorization 


1. INTRODUCTION 


Knowledge tracing tries to model the knowledge of students 
as they learn, and is a key component of Intelligent Tutoring 
Systems (ITS). In such systems, the aim is to recommend re- 
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sources, such as exercises (or “problems”), to students in the 
most effective way, that is, to recommend resources which 
correspond to their learning needs. These resources can be 
of various forms, but in this article we focus solely on rec- 
ommending new problems for the student to solve. In order 
to perform any recommendation, we believe that we should 
be able to predict the outcome of one particular student try- 
ing to solve one particular problem; we call this a (student, 
problem) pair. Ideally, we have perfect information for all 
such (student, problem) pairs, whether that information is 
actual (extracted from observation) or deduced (based on 
previous observation). This would allow us, for instance, to 
skip problems that are predicted as being too easy or too 
hard for a student. 


Each (student, problem) pair could reflect a level of “diffi- 
culty” indicating the student’s proficiency. In such a system, 
one would derive existing difficulty levels from known inter- 
actions, for instance through how much time was required for 
a student to solve a problem, or how many attempts it took 
to successfully solve it. A “good” system would then predict 
difficulty levels for interactions that did not happen, possi- 
bly with a confidence measure of the outcome prediction. It 
would also provide an understanding of the structure of the 
problem set. For example, it would enable the recognition 
of problems that train similar skills or use similar knowl- 
edge, without relying on expert knowledge components that 
require human expertise. 


Historically, the field of knowledge tracing has been inde- 
pendent of recommender systems. With expert knowledge 
components, one can explicitly measure student proficiency 
with simple models like Item Response Theory and Bayesian 
Knowledge Tracing 2]. Using data mining on large datasets, 
it is possible to relax the knowledge components to be latent 
features that do not require human experts to partition the 
domain into explicit student skills fj. Techniques in ed- 
ucational data mining are inspired from techniques used in 
collaborative filtering [i], such as factorization methods 
[18], but also from techniques used in deep learning such as 


deep knowledge tracing [13]. 


In this article, we will focus on matrix factorization methods. 
These are traditionally used in contexts where the available 
data is not very sensitive to time, for instance movie tastes 
and shopping habits. In contrast, students learn each time 
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they practice and should normally improve with time, so it 
would make sense to take history into account when analyz- 
ing datasets, making predictions and doing structural stud- 
ies. However, we do not know how much impact this has on 
the results. The question that we would like to raise here is 
whether taking history into account is that important or if 
it is still possible to make good predictions when consider- 
ing datasets as timeless. Recommending a good problem in 
terms of teaching is not an easy task, but it is even more dif- 
ficult when we cannot reliably predict whether the student 
will succeed or fail, and how long it will take them to do so. 
In the rest of this article, we will study how the matrix fac- 
torization algorithm behaves in three datasets from the ed- 
ucational data mining community compared to one dataset 
from the traditional collaborative filtering community. We 
will often deliberately leave out chronological information in 
the educational datasets to see how much information can 
still be extracted, compared to a traditional dataset. 


This article makes two main claims: 

e Educational datasets contain much less static informa- 
tion than usual datasets found in multimedia recommen- 
dation. Hence, treating educational datasets without any 
dynamic method should be avoided; 

e The little static information they contain amounts to a 
one-dimensional value per student or problem. 


We also propose in a pre-processing procedure 


for educational datasets meant to facilitate prediction, even 
though we could not find any variable that is accurately pre- 
dicted among all tested datasets, as well as a filtering pro- 
cedure to try to find clusters among students and problems 


that are particularly accurately predicted in 


2. RELATED WORK 


2.1 Data Pre-processing 

It is sometimes necessary or advantageous to perform pre- 
processing of the data before trying to extract information. 
For example, it was possible to improve the classification 
error in the MNIST database from 12% to 8%, keeping the 
same linear classifier, by using deskewing pre-processing (3). 
Regarding our knowledge tracing problem, many corrections 
to the ASSISTment dataset are proposed by Xiong et 
al. |19}. They are now included in the public dataset that 
we use later. We also propose some more pre-processing in 
Section 3 


2.2 Matrix Factorization 

Matrix factorization (MF) is a widely used technique in rec- 
ommender systems, as illustrated by its extgensive usage in 
the 2009 Netflix Prize Competition [7]. We consider a set 
U of N users, a set I of M items, and a set of ratings R. 
These sets are usually given as records (u,i,1ru,i), represent- 
ing how much (rz,:) a given user wu likes item i. From these 
we can build a sparse rating matrix X € RY*™. The goal 
of matrix factorization is to find two matrices W € R‘** 
and H € R™** (usually with low rank k < N,M) such 
that X is close to WH™. This is an optimization problem 
written as: 


argmin 
NXk , 7 
Ma sarhe Gj)EQ 


(Xig — wih})” + AW +I) ) 


Where J is a regularization meta-parameter and Ales is the 
Frobenius norm [21]. eee ae vary in regularization 
terms (bias, sparsity penalty...) and can incorporate a loss 
function between X;; and wih}. We can now estimate un- 
known ratings within the product WH™. In other words, 
we look for signatures for users and items in the same la- 
tent space of dimension k (i.e., vectors of rank k), such that 
the outcome of the user rating an item is close to the dot 
product of these signatures. 


This optimization problem is non-convex in general, but 
different methods exist [7]. The Alternating Least Square 
(ALS) method is the most popular method as it converges 
better than the Stochastic Gradient Descent (SGD) method 
due to non-convexity. When large-scale data is needed, as 
ALS is not easy to parallelize, and Coordinate Descent is 
preferred (6| [21]. In a knowledge tracing setting, users are 
students and items are problems [18]. Problems are 
usually split into smaller components that are the problem 


steps. We will see in|subsection 3.1] why we recommend a 


first regrouping pass in order to work with whole problems. 


2.3 Cold Start Problem and Online Settings 


The cold start problem is a typical problem in recommender 
systems that corresponds to the initial phase of a ”nude” 
system (no data collected yet). The lack of data makes the 
prediction accuracy unreliable at that early stage. MF tech- 
niques are not designed to tackle the cold start problem but, 
some extensions seek to solve it partially [10]. As we 
are not focused on prediction accuracy, we will not consider 
these extensions in this article but we will try to evaluate 
when the cold start problem ends in that is, when 
there is enough data for MF to start giving results. Trivedi 
et al. try to solve a cold start problem in an ITS environ- 
ment with spectral clustering to help refine raw prediction, 
but they work on the raw features of datasets without stu- 
dent or item signatures. 


Even after the cold start, the system usually benefits from 
new data in general. This is referred to as online recom- 
mendation, and MF is widely studied in such a context 
[9} [22]. These works consider extremely large datasets, about 
the order of millions of users and items, but it is still fea- 
sible to redo a factorization after adding a few elements, as 
we will see for instance in [Section 4] and [5] 


3. FLAT PREDICTION AND AGGREGATION 


In this section we studied matrix factorization (MF) on four 
different datasets, and found that not all datasets were di- 
rectly usable without some pre-processing, compared to clas- 
sical datasets. We use the basic version (L2 regularization) 
of [Equation 1] with a fixed rank of 20 (apart from the last 
section where we measure the impact of rank variation). 
We use coordinate descent for the optimization because 
some experiments in sections [4] and [6] require numerous 
factorizations. 


3.1 Educational Datasets and Pre-processing 
We will use three common educational datasets for the rest 
of the article: Algebra | 2006-2007 [14], Bridge to Algebra | 
2006-2007 (both of these come from the Cognitive Tu- 
tor problem set) and ASSISTment09 (we use the corrected 
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Table 1: Raw data sets overview 


Data set Users Problems Steps Steps occurring once Mean samples per step Samples 
Algebra I 2006-2007 1338 5644 418 060 314 198 5.4 2 270 384 
Bridge to Algebra I 2006-2007 1146 14 787 202 672 46 935 18.1 3 679 199 
ASSISTment09 4217 17725 26 688 3123 13.0 346 660 
ML-1M 6040 3706 N/A N/A 269.9 1 000 209 
Table 2: Preprocessed data sets overview 
Data set Users Problems Samples Density Mean samples per problem Success percentage 
Algebra I 2006-2007 1147 3111 152 709 0.043 49.1 0.79 
Bridge to Algebra I 2006-2007 1068 8736 235 147 0.025 26.9 0.91 
ASSISTment09 2025 12587 238 746 0.009 19.0 0.98 
ML-1M 6040 3706 1 000 209 0.045 269.9 N/A 


and collapsed version of the dataset) i All three datasets 
record scaffolding problem statistics (also called steps in 
Cognitive Tutor datasets — we will use both terms here). 
For each record (also named sample), we extract: 


1. A student and a main problem ID; 

2. A scaffolding problem ID; 

3. A timestamp when a student starts a step and the 
duration to complete the step; 

4. If the student succeeded at his first attempt: Correct- 
First-Attempt (CFA); 

5. The number of hints and errors of the student for this 
step. 


To our surprise, these datasets are not very usable without 
pre-processing in comparison with well-established recom- 
mender system datasets like the MovieLens dataset. We will 
compare most experiments with the ML-1M version of this 
dataset, which will act as a “control” dataset: multimedia 
recommendation datasets being the canonical use of matrix 
factorization for recommender systems. The two main rea- 
sons for this poor usability that we try to mitigate with 
pre-processing are the following: 


e The notion of scaffolding problem is not standardized 
between the datasets and is hard to use as is. Some of 
them are optional, which makes the number of steps for 
a main problem vary between students. The step order 
may also change between users, which makes matching 
between users more difficult at the problem level. 

e There is no guarantee of the minimum number of occur- 
rences for a student or a step. Moreover, many steps 
are done by a single student across the whole dataset, 
as seen in [Table 1] (especially for the Algebra | dataset, 
where steps can be generated for a student from a tem- 
plate, and are thus unique. These constitute up to 3 
quarters of the steps). 


Our first pre-processing pass, which is motivated by the 
very low number of samples per step on average, corre- 
sponds to aggregating all the steps of a common main (stu- 
dent/problem) pair together. Aggregating timestamps and 
durations is straightforward (the beginning of a problem is 
the beginning of the first step and the total duration is the 
sum of the steps’ durations). To aggregate “Correct-First- 
Attempt” we take the mean across a (student/problem) pair 
so we obtain a floating point value between 0 and 1 instead 
of a boolean value. 


Simply aggregating hint and error counts by summing them 
is not satisfactory because ultimately we want to have an 
idea of how much a student struggled on a problem. Sum- 
ming these quantities is not sufficient to access some basic 
information such as “Has the given student reached the end 
of the problem or given up?” This information is not pro- 
vided in the datasets, so we had to build a proxy variable. 
To answer this question, we need to know, for a given prob- 
lem, the number of basic steps it decomposes into. To find 
this quantity, which we call the problem size, we counted 
for each problem/student pair the number of samples. For a 
student who succeeded (possibly with hints and intermediate 
mistakes), the problem size and this number should match. 
We assumed that for a given problem, the most represented 
number (among all students) was the actual problem size. 
We believe that this high representativeness comes from the 
fact that the ITS providing the datasets give enough hints 
for most students to reach the end of the problem before 
giving up. This makes the number of hints and errors valu- 
able information to measure the difficulty of a problem for 
a given student. 


Once we have a boolean proxy indicating success by reaching 
the end of a problem, we can derive two variables: reaching 
the end without errors and reaching the end without hints. 
We can also build a difficulty variable to aggregate the hint 
and error counts: we sum the two counts with a 0.5 coeffi- 
cient for hints. We represent failure by assigning a difficulty 
value of twice the maximum value. 


After aggregation, we have six variables (called target vari- 
ables or simply targets from now on) of interest for each stu- 
dent/problem interaction: duration (0-1 scale value), diffi- 
culty (0-1 scale value), correct-first-attempt (0-1 scale value) 
and success-reached (boolean value), success-no-error (boolean 
value), success-no-hint (boolean value). The first three are 
normalized per problem so that for each problem, the ’worst” 
student gets a value of 0, and the best one a value of 1 (giving 
rise to what we called above a 0-1 scale value). ML-1M has 
a single target which is the movie rating (also normalized 
for comparison). After aggregation, we filter out users who 
have done fewer than 20 problems and problems that are 
done fewer than 5 times (same threshold than for ML-1M). 
Table [2] shows the size of the datasets after pre-processing. 


3.2 Influence of Aggregation on Datasets 
Figures and |3} report the ability of a factorization to 
accurately model the different target variables on the four 
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datasets and compare the effect of aggregation. For 0-1 scale 
variables we use the root mean square error (RMSE — the 
smaller the better) as the error metric, and for boolean vari- 
ables we use Receiving Operating Curve Area Under Curve 
(ROC AUC — the closer to 1 the better). For ML-1M, the 
only possible target variable is the movie rating. We report 


it in as a horizontal line. 


In we see that the aggregation and filter proce- 
dure improve the prediction quality of the Cognitive Tutor 


datasets in a notable way, but only by a small margin for the 
ASS|ISTment dataset. We believe that if the pre-processing 
removed about one third of the problems and half of the 
students, the density would still be very low compared to 
the others. In all the remaining experiments, we will use 
the aggregated versions of the datasets. 


It is hard to find any trend regarding the RMSE differ- 
ences in[Figure 2] Variations seem to indicate that different 
datasets favor different target variables. MF can have about 
the same prediction capability for educational datasets and 
multimedia datasets if the target variables are chosen care- 
fully, which suggests that situations call for pre-analyses in 
order to select the target variable which will be the most 
accurately predicted. 


In|Figure 3] we can see that accuracy on success classifica- 
tion is reasonably good. However, we cannot explain the 


difference in ASSISTment between success-reached and the 
two other success target variables. This might stem from the 
aggregation procedure that relies on approximated methods 
to obtain the number of steps in a problem. We will not do 
any further experiments on these target variables (which are 
boolean), as they are barely comparable with the 0-1 scale 
variable of the ML-1M dataset. 


4. ONLINE PREDICTION 


In this section, we will try to evaluate the point at which 
there is enough information to predict reasonably well with 
MF techniques. This allows the system to stop using what- 
ever bootstrapping technique it was using to solve the cold 
start problem. 


To evaluate this we start by getting either the full student set 
(and no problem) or the full problem set (and no student). 
We then progressively add new problems in the first case and 
new students in the second case, adding 20 new elements at 
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Figure 2: RMSE for duration, correct 
first attempt and difficulty 
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each iteration. At each iteration we redo a full factorization 
and evaluation as if the system was complete. 


We independently measure RMSE of correct-first-attempt 
and difficulty variables. To evaluate whether the order in 
which new elements are added makes a difference or not, we 
considered three orders: (i) elements sorted by their number 
of occurrences either in decreasing (high density first); (ii) or 
in increasing (low density first) order; (iii) and following the 
chronological order (chrono). Only the chronological order 
makes sense in an online context, but we still use the number 
of occurrence orders to evaluate whether or not we benefit 
from a higher density. 


We report in[Figure 4]|the results of this experiment. We can 
see that for three out of four datasets (not ASSISTment), 
adding elements by highest density makes the system con- 
verge really fast (about 200 elements for Bridge), which was 
to be expected as those elements carry the most informa- 
tion. For all datasets, adding elements by lowest density, as 
we might expect, makes the system converge really slowly. 
We believe that the extremely accurate prediction on some 
of the curves for the first few iterations of the growing pro- 
cess is due to overfitting (recall that the factorization uses a 
rank of k = 20 in those experiments). 


Still, there are some artifacts to these results. In 
the previous claims are reversed for difficulty target. Maybe 


this is a hint that this aggregated variable may not be robust 
enough on all systems. Our advice is to systematically test 
target variables on a system to make sure that the ones we 
choose are consistent and can be trusted. 


Finally, we do not observe any “dramatic” drop of the RMSE 
in curves representing the chronological order that we could 
clearly label as the “cold start” (although it sometimes takes 
a few “adds” to stabilize). Of course, the highest accuracy is 
obtained whenever all the data is used, but this suggests that 
MF accuracy starts to get close to the maximum early in the 
process. However, bear in mind that we only evaluated our 
ability to model existing data (we evaluate on the matrix 
we factorize), but did not evaluate our ability to predict 
(by evaluating on the remaining, not factorized, part of the 
matrix). 
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Figure 4: RMSE evolution for student and problem set growing on the four data sets 


If not specified the legend is the same as (b). 


5. STUDENT OR PROBLEM PREDICTION 
KERNELS 


In this section we search for a subset of students and prob- 
lems where the prediction is more accurate than on the rest 
of the dataset. Having such a subset can be of interest in 
various ways: further analysis like signature clustering might 
work better on a subset with high accuracy prediction, or 
this can be a first step towards building a confidence mea- 
sure for new predictions using a similarity measure with this 
accurate subset. 


Note that we briefly tried some clustering algorithms on the 
student and problem signatures given by MF, but they were 
not promising. We will explain in [Section 6] how the sig- 
natures we can obtain with MF may not be appropriate to 
such a study. 


5.1 Iterative Filtering 
We describe an iterative procedure to filter students and 
problems that have the least accurate predictions. 


We alternately remove students and problems: at each iter- 
ation we remove the 8% of the considered set that are the 
least accurately predicted in terms of RMSE (or 15 elements 
if 8% is lower than 15). We report in[Figure 5|and [Figure 6] 
the evolution of the density of the rating matrix and RMSE 
for the difficulty target variable. 


Figure 5} presents the variations in density as we progres- 
sively remove students and problems. Interestingly, remov- 
ing items usually increases the density while removing stu- 
dents decreases it in the three educational datasets, mean- 
ing that the students that solved many problems are viewed 
as “problematic” by the system. This behavior is not ob- 
served in the reference ML-1M dataset. confirms 
the tendency that removing students in the beginning tends 
to improve prediction accuracy. This result is disturbing 
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as it means that, for the educational datasets, MF prefers 
less dense matrices with regard to the users, i.e., less in- 
formation for a given student. This suggests that MF per- 
forms best when a problem was done by many students, 
but when the students have done few problems. What is 
interesting here is that this scenario is the one that resem- 
bles most closely the ML-1M dataset: by having students 
that did fewer problems, we are indeed eliminating students 
that likely progressed during the experiment, hence whose 
behavior cannot be represented by a single vector across all 
their interactions. This is a first solid hint that MF alone 
does not seem suited to educational datasets, as it shuns 
chronological subtleties. 


6. INFLUENCE OF RANK VARIATION 


In this section we repeat the experiments from previous sec- 
tions with different rank values. In addition to rank k = 20 
that we already measured, we use rank 5 and a rank of 1. We 
deliberately choose a rank of 1 to mimic a Whole History 
Rating (WHR) [3]. Even though it is not an exact corre- 
spondence, we believe that the information extracted by a 
MF with rank 1 can also be extracted by a WHR. 


We see in Figures [9] and a clear difference between 
the educational datasets and ML-1M regarding the influ- 
ence of ranks. This benefit from rank increase agrees with 
the intensive use of MF techniques in multimedia recom- 
mender systems. However, the benefit of such an increase 
for educational datasets is almost negligible. This is par- 
ticularly apparent in where the ML-IM RMSE 
curves get lower with increasing rank while all other curves 
are nearly indistinguishable by rank. This shows that, when 
the chronological information is not used, vectors of size 5 
or 20 do not improve accuracy compared to a simple vector 
of size 1, ie., a single float. This suggests that we can- 
not do better than assign a single number to problems and 
students, which could be interpreted as having a “difficulty” 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


— filtering users 
—+ filtering items 
— algebra 
— assis 


RMSE 


60 80 0 


Difficulty 


2 0.002 
a 
o 0.0015 
Qo 
0.001 
0.0005 
0 
0 20 40 
Iterations 
Figure 5: Evolution of density during filtering with target 
Difficulty 
baal 


= 1 baseline 


a] at 
1 
mk=5 | ; 
= k=20 n | I 
oo! oar aes | 


ve Se ao oe 


RMSE relative to k 


& Se 
aes ao fe ON \ o ow 


— 

age ® e 

a” oo ‘S ae SOF oo™ 
oo 


a 


Figure 7: Evolution of flat factoriza- 
tion RMSE depending on rank rela- 


04, 
sin ee ee 
0.35 — z 


— algebra ----- k=1 
— bridge --k=5 
—k=20 


“~ —— assis 
——milm 


RMSE 


500 1000 1500 

item number 
Figure 8: Evolution of RMSE while 
adding problems with target Difficulty 


2000 


— filtering users 
—+ filtering items 
— algebra 

— assis 

— bridge 
——mlim 


40 60 80 
Iterations 


Figure 6: Evolution of RMSE during filtering with target 


— algebra coe k=l 


— bridge --k=5 
O8- —assis —k=20 
Ww 4 ——miim 
g SN ene a aL tee ee STN 
mw 0.25 
Meee lee ee 
0.2. 


200 400 600 
user number 


Figure 9: Evolution of RMSE while 
adding students with target Difficulty 


800 1000 


tively tok = 1 


— algebra ----- = —alqehra ----- = 
g k=1 0.35 algebra ----- k=1 q algebra K=1 
O22 —bridge --k=5 oa 0.35. —bridge --k=5 
— assis 20 — ae Res a ; 
Oe 0.25 —assis —k=20 — assis 
Ww Ww 3 Ww 
) 0.15 = 7) 7) 
= el. > = 
a “ in in 
0.1 
0.05 
0 20 40 60 80 
Iterations Iterations Iterations 


(a) Duration 


rating for problems and a “skill” rating for students, mimick- 
ing a WHR rating system. This is a second strong hint that 
educational datasets do not have the same structural prop- 
erties as datasets from multimedia recommenders, and that 
if we want to extract more information and discover a better 
characterization of students and problems, it is necessary to 
consider chronological information. 


7. CONCLUSION 


We applied preprocessing to common educational datasets 
to try to improve the accuracy of MF techniques. While 
these did improve the results, we also showed that when 
MF techniques from the collaborative filtering community 
are directly applied, they do not benefit from having ranks 


(b) Difficulty 
Figure 10: Evolution of RMSE during filtering with various target 
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higher than 1, meaning that the attribution of a single value 
to students and problems is about as effective as we can get. 
This seems to indicate that MF techniques might not be 
the most efficient model to extract static information from 
these datasets, or, more probably, that static information 
is scarce. We believe that this stems from the fact that, 
unlike users in multimedia recommender systems, students 
change over time as they are faced with new problems but 
also from outside interactions not recorded in ITS, hence 
chronological information needs to be taken into account 
in order to improve accuracy and make predictions. Still, 
in the eventual absence of more sophisticated analyses in a 
recommender system, MF can be used to extract a crude 
measure of what could be labeled as a level of difficulty of a 
problem and a level of proficiency or skill of a student. 
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ABSTRACT 


Finding the optimal topic sequence of online courses requires 
experts with lots of knowledge about taught topics. Having a good 
order is necessary for a good learning experience. By using 
educational recommender systems across different platforms we 
have the problem that the connection to an ontology sometimes 
does not exist. Thus, the state of the art recommenders can suggest 
courses with an optimal order within a platform. But on a more 
global view, a recommendation across different platforms with 
optimal order is not existing as long as no ontology was defined or 
courses are not connected to an existing ontology. Nowadays 
experimental approaches manipulate the learning paths to find the 
optimum. As this can impact the learning experience of 
participants, this approach is ethically unacceptable. To overcome 
this problem, we propose a data-driven approach using the search 
engine result pages (SERPs) of Google. In our experiment, we used 
pair-wise search queries to get access to web pages, those 38.000 
texts were used to test some NLP metrics. 10 different metrics were 
examined to create an optimal order that was compared to the 
optimal sequence defined by experts. We observed that the 
Gunning Fog Index is a good estimator to determine the optimal 
order within a cluster of topics. 


Keywords 


Course Sequencing, educational recommender system, web search, 
adaptive courseware, personalization. 


1. INTRODUCTION 


Providing the optimal sequence of topics in online courses is of 
high interest because it influences the learning outcome as well as 
motivation. Lots of MOOCs are existing, but in which order they 
should be done is defined by experts and this is a time-consuming 
procedure. Large-scale educational recommender systems [1] 
suggest online courses across different platforms. Creating an 
optimal sequence based on an ontology is an easy solution as an 
ontology includes the optimal order, defined by human experts. 
This can be done within single platforms, but an ontology across 
different courses across several platforms is not existing. McCrae 
et al. [2] state that it “is difficult to link to ontologies”. The 
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willingness to create a connection of own online courses to an 
existing public ontology is low as this is expensive due to manual 
work on the one hand but can also result in course 
recommendations of other suppliers on the other hand, which does 
not meet the interests of the suppliers. 


The optimal sequence is missing in recommender systems as long 
as no manually created large-scale ontology or optimal sequence 
exists. Recommender systems only provide a ranking based on how 
well the suggested courses fit into the user's learning situation. 
There are existing approaches, e.g. linked data to create a structured 
semantic web [3]. Their idea is to create a network that contains the 
meaning of the data. But there is the problem that the semantic web 
is limited to specific domains. If the networks have not been created 
for the topics that we need, we cannot use them. Besides, the 
structure in the semantic web is designed to understand 
relationships between objects, not whether there is a dependency 
from the educational perspective. Further on, there is the problem 
that topics for online courses often consist of multiple words to 
describe the topic or concept. Finding the correct corresponding 
concept within the semantic web can be challenging. 


Having an optimal order of online courses is of high interest in 
online education as many topics require the knowledge of 
subtopics. Knowledge dependencies can be modeled by experts 
manually on the one hand, but this is a cost-intense procedure that 
requires lots of knowledge about the taught topics and provided 
courses as well. On the other hand, the world wide web is full of 
contents of different quality. Every topic that can be taught can be 
found there, but the contents of web pages are still not used for topic 
sequencing in education. Crawlers get access to all the texts and 
companies like Google define an order of pages related to a search 
query. Within a search engine, we get access to all pages that they 
define to satisfy the user intent [4]. Using this large number of 
pages for each topic could be beneficial in creating optimal topic 
sequences for online courses. 


An optimal order is very important for a good learning experience 
in online courses. We define an optimal order as the sequence of 
course topics where each topic should be taught when all pre- 
requirements are fulfilled based on the previous courses. As long 
as topics are taught where the requirements are missing, the dropout 
rate will be high. Using courseware (single parts of a course) [5] to 
generate a new online course it is important to have an optimal 
order. Otherwise, the participant cannot understand the topic 
because of missing knowledge. The same problem exists in AI- 
generated learning paths of online courses, which must be 
consistent according to the fundamental didactical method of 
starting teaching basics, not with specialized knowledge. 
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Observing the world wide web, we can get a variety of texts on any 
topic. We want to use this already existing large set of pages to find 
the optimal topic order of online courses with an experimental 
approach. To access all the pages with corresponding texts that we 
need, we use the search engine Google, especially the Search 
Engine Result Pages (SERPs) [6]. It is known that Google uses the 
semantic web in the background, depending on the search query, 
which helps to overcome the challenge to find the optimal 
corresponding concept of the semantic web using the search engine 
as a proxy [7]. This is beneficial for the case that for specific 
domains no linked data is existing — as the search engine tries to 
provide related resources, even if they are misspelled and the search 
engine can give results for queries that they have never seen before. 


Using topics as keywords results in a list of pages that satisfy the 
user intent according to Google [4]. This can be used as a base for 
having access to different features for ordering course topics. 
SERPs help to understand the popularity of how many pages are 
indexed by the search engine, which could be an indicator for good 
sequencing as less specialized topics are existing compared to 
general basic topics. Besides, having a lookup for two topics in one 
search, we get pages that contain both keywords, those frequency 
or deviation could be an indicator for finding the optimal sequence. 
If we observe online courses, then we usually have an increasing 
difficulty level. Using the complexity of texts could help to estimate 
the optimal order based on the difficulty. 


In this paper, we concentrate on the three research questions: 


1) Is the SERP popularity of all topics a good indicator to find an 
optimal topic sequence for online courses? 


2) Is the topic frequency of page texts that are listed within the 
SERPs an estimator to determine the optimal topic sequence in 
online courses? 


3) Does ordering topics’ texts by text difficulty metrics result in a 
sequence that is appropriate to be used as a sequence in online 
courses? 


2. RELATED WORK 


Brusilovsky et al. [5] define this problem as “sequencing of 
lessons” where each lesson is connected to a topic. This contains 
numerous chunks of educational material, ranging from videos and 
texts to different interactive tasks. The authors use a domain 
concept structure, that is stored independently from teaching 
materials. Each concept needs to be linked to the teaching material. 
It has the advantage of being able to use the courseware to generate 
a personalized online course according to the interests and 
knowledge gaps of a learner. This approach is comparable to using 
an ontology that needs to be defined by experts, based on rules and 
graph representation. It is the fundamental model to define an 
optimal sequence of online courses but requires the creation of the 
ontology by experts. 


S. Fischer [8] uses an ontology knowledge base, namely a 
“knowledge library” to create an optimal course sequencing. 
Therefore, they use modularized media content as courseware 
together with metadata that describes the link to the ontology 
model. With that, they have access to a taxonomy that can be used 
to create a good ordering of topics as well as generating questions 
with right and wrong answers (depending on the granularity of the 
ontology). The modular resources can be used to generate courses, 
according to the knowledge gaps of learners. 


Xu et al. [9] propose to learn from users providing specific course 
sequences for testing and use their performance to create an optimal 
sequence for new users. While this approach works it has the 


disadvantage that it requires real test users which may perform 
badly within the scenario. Doing this in a field study is acceptable 
but using real students is not sustainable from an ethical point of 
view. We want to emphasize that we do not want to use this 
experimental user behavior data as this is ethically not acceptable. 


Cucuringu et al. [10] used already captured student participation in 
courses to create pair-wise comparisons using ranking aggregation 
to create a global ranking. This ranking proposes an order of how 
courses should be taken by students. One major problem is 
incomplete data as some pairings are not existing for a comparison. 


S. Morsy [11] states that a global ranking of online courses cannot 
be used for personalized recommendations. But having a global 
ranking can be helpful to determine which courses should be done 
in which order. Combining this knowledge with personalized 
courses or topic recommendations is helpful as the course 
dependencies (e.g. what knowledge is necessary to understand a 
topic) are the same for personalized recommendations, which are 
filtered by topics/concepts that the learner is already aware of. Thus 
having a global ranking can be beneficial for personalization as 
well. 


Using the information of chosen courses by students and their 
performance is a good way to determine an optimal course 
sequence. A major limitation with that approach is the limitation of 
data and to have access to chosen courses and the resulting 
performance. This approach does not comply with the GDPR as the 
information on whether students passed or failed an exam is 
classified to be sensitive personal data, that cannot be accessed for 
course sequencing in general [12]. Thus, their application does not 
work in a real-world scenario in the EU. Based on the limitations 
of being dependent on user performance or manually created 
ontologies, we propose a new methodology to create an optimal 
order of online courses, based on their topic. 


3. METHODOLOGY 


As we learned from Riidian et al. [13]: Even if experts are scoring 
the same results of educational tasks, their scores vary among each 
other. If we observe the order of topics, then we know that there is 
not always a perfect solution regarding the whole sequence because 
of ambiguous expert opinions. In the pre-study, four experts (AI 
instructors) had the task to create the optimal order of 20 Al-related 
topics to be taught within online courses. We used the following 
topics: neural networks, voice recognition, chatbots, Linux, data 
visualization, Python, statistic basics, part-of-speech tagging, 
LSTM, data preparation, deep learning, TensorFlow, object 
recognition, Naive Bayes, natural language processing, ethical 
principles, clustering, reinforcement learning, cross-validation, 
and regression. The resulting sequences are then used to make a 
pair-wise comparison to understand the overlap across instructors 
and to see where we have a high overlap. The pair-wise sequence 
score S is defined as followed: For every topic A and B of the expert 
sequence with A # B and every topic C and D of the sequence 
derived by the algorithms or another expert with C # D we count 
all hits where (A < Band C < D) or (A >BandC > D). Thus, 
the topics have the same order within both sequences. This number 
is divided by the number of possible combinations, defined as S. 


Some topics have dependencies; e.g. neural networks should be 
introduced before teaching LSTM or natural language processing 
should be taught before starting with part-of-speech tagging; others 
do not have strong relations and can be taught somewhere, e.g. 
ethical principles or Linux. 


The idea of the main study is to compare the sequences created by 
instructors with algorithmic ones. Therefore, it is a good fundament 
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to measure tutors’ decisions among each other first to define 
accuracy as a gold standard that we want to achieve with our 
methodology. Thus, we need a realistic generalizable accuracy that 
we should achieve instead of over-optimizing our approach with 
high accuracy that is the optimum for a sequence of one expert only. 
Besides, the pre-study identifies the optimal order of the topic 
subsets that are the same for all experts. The order of these topic 
subsets will be compared to the order that we get from our 
algorithms to see how well the algorithms perform in a real-world 
scenario. 


Our approach is to use the Search Engine Result Pages of Google 
(SERPs) and we derive different metrics based on the results. A 
search engine can be used to find web pages that are related to given 
keywords. One of the main purposes of the search engine Google 
is to satisfy the user intend by providing a list of web pages that are 
related to the search query [4]. Thus, using it allows us to get pages 
that have a high authority according to the Google ranking 
algorithm, which is, according to them, a metric of high quality. We 
use this list of pages with our topics as search queries to understand 
the popularity, the number of user searches, the complexity of 
topics, and which topics have a semantic connection. These metrics 
are then used to create a sequence, based on a linear order of the 
observed data. These sequences are compared to the experts’ ones 
to understand whether there is a connection between our metrics 
and the optimal order, defined by experts. 


Topics 
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Figure 1. Pipeline for pair-wise search, extraction, and 
separation of features. 


The approach of using a search engine as a basis has the advantage 
that we do not need to do experiments with students where they 
may be badly influenced due to bad testing sequences. Thus we are 
independent and can use our approach on a larger scale. That makes 
our approach more practically usable. We use different data as a 
basis, use them to rank our 20 topics, and compare the order with 
the instructors’ ones. Our approaches are the following: 


1) We use the number of topic results that are estimated by the 
search engine by searching for every keyword separately. 


2) We use the number of topic results pair-wise keyword 
combinations and observe the number of estimated results. 


3) We use the keyword search estimator and rank our topics 
according to the estimated search amount. 


4) We use the first 100 results of all pair-wise keyword 
combinations and count how often both keywords within the 100 
listed pages exist. 


' https://seorld.com/crawler 


5) We use the first 100 result pages as in 4), search for both 
keywords on the pages, and summarize how often each keyword 
occurs at first in the text. 


We use the 100 result page texts and apply three algorithms to 
estimate the text complexity, namely 6) Flesch-Reading-Ease 
(FRE) [14], 7) RIX [15], and 8) Gunning Fog Index (GFT) [16]. 
Then we use the 100 result page texts with basic NLP metrics: 9) 
The type-token ratio (TTR) and 10) the number of words per 
sentence (NoW). We assume that observing how many pages are 
existing in combination helps to identify topics that have a semantic 
connection. Using the information on how many pages are existing 
gives hints about the popularity (1,3), where for complex topics 
mostly less content exist than for basics. Observing the complexity 
of the contents (4-7) could help to identify the difficulty level of 
topics to find the optimal sequence. Figure 1 visualizes the method 
for 4)-10) to get features based on a pair-wise topic search. 


The Flesch-Reading-Ease Index is based on the “Standard Text 
Lessons in Reading” [17] and is calculated from the average 
sentence and word length [14]. The main idea of the Gunning Fog 
Index is to reduce the complexity in newspapers as a kind of 
warning system for authors that texts are not “unnecessary 
complex”. Therefore, the author uses the sentence lengths, the 
number of syllables, easy words, and hyphenated words to estimate 
the complexity of a text [18]. The “Regensburger Index“ (RIX) uses 
difficulty parameters like passive, sentence complexity, and 
predications to derive the complexity [15]. All approaches differ in 
the selection of features that are used to create the indexes. 


Finally, we use a random forest regressor [19] to predict the pair- 
wise sequence, using the data of 4)-10) to estimate the feature 
importance to support our findings. To get all the data, including 
the SERPs, all pages, and the estimated search amount, we use a 
commercial web crawler for SERPs'. This is necessary as the pair- 
wise lookup of 20 keywords results in 20* 20 — 20 = 380 
searches, where we need to download 100 web pages each, 
resulting in 38.000 files. A simple crawler that we used in our lab 
before, was banned after 20 crawls, thus using a commercial one is 
the most efficient option. 


Each data source 1) — 10) is then used to create a ranking of topics, 
based on their linear order. These sequences are compared with the 
expert ones to find the optimal feature that can be used in a real- 
world setting. To compare the sequences of the experts with the 
algorithmic ones, we use a pair-wise topic comparison to test 
whether the order is the same in both sequences and summarize the 
hits. Thus we can compute the overlap that represents the accuracy 
in our experiments. 


4. RESULTS 


The overlaps across the expert sequences range from 0.6 to 0.8 
(Table 1). Thus we have an orientation of the resulting overlap that 
can be achieved with our approach at maximum. While the overall 
sequences defined by experts are partly different, we identified 
some partial sequences that are identical across all expert-based 
rankings and use them as ground truth. We detected some matching 
sequences of topics: 


A = [“data preparation” — “data visualization” — “clustering” ], 
B = [“neural networks” — “deep learning” — “LSTM”], and 
C = [natural language processing” — “part-of-speech tagging” — 
“voice recognition” — “chatbots”], 
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where [A — B] means that topic A needs to be explained before 
topic B. This makes sense as each topic mostly requires knowledge 
of the previous one(s), e.g. “neural networks” have to be introduced 
first and after that, “LSTM” can be explained. We use the three 
clusters (A, B, and C) to visualize whether our rankings make sense 
in a real-world scenario as the overlap of sequences defined by a 
number only is too abstract. All in all, in our pre-study we can 
conclude that we identified three clusters using sequences of four 
experts. 


Table 1. Pair-wise sequence overlaps of 4 AI experts. 


Expert 1 | Expert2 | Expert 3 Expert 4 
Expert 1 - .60 65 .80 
Expert 2 - - 65 .65 
Expert 3 - - - Af is) 
Expert 4 - - 7 = 


Then we used all the different data points that we got from the 
crawler separately and created a sequence based on their linear 
order. Table 2 shows all the results of our experiments. We 
calculated the pair-wise overlap to compare estimated sequences 
with the expert ones. Also, we tested whether our partial sequences 
of the topic sets in A, B, and C have the same order as defined by 
our experts. 


Observing 1) - 3) we can answer the first research question as these 
metrics represent the popularity of topics within the SERPs. The 
ordered list of topic pages is not a good indicator to find an optimal 
topic sequence for online courses. Thus, popularity is not a good 
indicator of course sequencing. 


Table 2. Overlap of sequences with four experts (E1...E4) and 
the information on whether the orders of our clusters A, B, 
and C are the same as defined by experts. 


Approach | El | E2 | K3 E4 A B C 
1) 55 | 40 | 45 50 No No No 
2) .60 | 50 | .40 50 Yes No No 
3) 45 | 45 40 6935 No No No 
4) 53 | .63 53 53 No No No 
5) 58 | 53 53 40 No Yes No 
6) FRE 35 | 50 | 55 50 No No No 
7) RIX 50 | 50 | .50 5 No Yes No 
8) GFI 55 | 65 | 60 | 60 | Yes | Yes | Yes 
9) TTR 45 | 40 | 40 40 No No No 
10) Now 50 | 50} .50 50 No No No 


Observing the pair-wise searches in 4) and 5) we can conclude, that 
topic frequency within the related texts is also not a good indicator 
to get an optimal sequence of topics, which answers the second 
research question. We limited the search to exact matches. Further, 
using n-grams or other methods to detect variants could be 
beneficial. 


We identified the Gunning Fog Index as an estimator to create an 
optimal order. This answers our third research question. Using this 


metric for text complexity is the most robust feature to create a 
good sequence of topics in our experiment. Also, the order we got 
from our clusters is the same as in the sequence that we got by using 
the GFI. This is very important for a practical educational 
environment as the orders of topics that have a taxonomy with 
knowledge dependencies need to be done correctly. The overlap 
with the expert sequences ranges from 0.55 to 0.65, which is 
acceptable as the overlap of sequences across experts was in the 
range of 0.6 and 0.8. The remaining text complexity metrics (6, 7, 
9, 10) are not as robust as the GFI. 


To get more insights into the importance of the identified predictor, 
we use the random forest regressor [19] as we investigate linear 
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Figure 2. Relative importance of features according to 
the random forest regressor. 


features only and — for future work — we want to identify features 
of high importance that also work for non-linear dependencies to 
predict the optimal sequence. Therefore, we use all pair-wise 
approaches (4-10) to train a decision tree using the random forest 
regressor. As prediction target, we used all pairings orderings (e.g. 
topic “neural networks” needs to be taught before topic “LSTM”). 
This is a classical approach to predict the ordering of items, based 
on different features. Figure 2 displays the relative importance of 
features that we got. The most relevant feature is the Gunning Fog 
Index (GFD), which performs best in our experiments as well. 


5. DISCUSSION 


Automated analysis of the pair-wise SERPs and the text complexity 
using the GFI can help to assist instructors during planning course 
sequences. From an ethical point, doing experiments with students 
is not justifiable as it could corrupt the learning outcome as a bad 
implication. As our approach is independent of experimenting with 
users, this method can be applied on a large scale. Combining this 
approach with educational recommender systems, we can provide 
a sequence of topics, based on the topic set that we get from the 
recommender system, even if no ontology is defined in the 
background. Using the text complexity helps to start with topics 
that can be explained more easily than the following ones. Having 
automatic composed online courses based on courseware, it can be 
beneficial to use the third party data of SERPs to find an optimal 
order. This is an important step to create personalized online 
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courses that are adaptive to the knowledge level, where no pre- 
defined ontology exists. 


There are various fields of application where we can use our 
approach. This method can be used for planning lecture sequences 
at school or university, based on the complexity of taught topics. It 
is the same in preparing new lectures, based on existing learning 
material, that can be composed in an optimal order. Besides, the 
curricula at universities could be optimized, where students 
participate in courses of different universities. Having a 
recommendation for a good order on which courses should be 
visited at which point of time is beneficial. 


From a practical point of view, it is important to note that the 
number of searches, while using a commercial crawler, is a cost 
factor. If n = “number of topics”, then the number of searches 
C = n? —n, having pair-wise searches A + B and B + A (with A 
and B being topics of the list). This is necessary as the SERP list of 
A+B is not the same as B + A. Also, the search query A+ B 
returns 100 pages that need to be crawled to get the texts. In our 
experiment, this results in 16Gb of data, having 38.000 texts of 20 
topics with pair-wise searches, where the metric has to be derived 
for each text. The required storage grows exponentially with the 
number of topics. The length of the resulting topic list of 
educational recommender systems can be limited in general, thus it 
is not a problem, but it is important to limit the list first, before 
finding the optimal order to avoid the need for large storage and 
high computational capacities. Besides, using the first 10 results of 
the SERPs instead of 100 reduces the crawling budget as well as 
computing time, but makes the approach less robust. 


In our experiment, we conclude that commercial popularity and the 
estimated search amount are no indicators for a good topic 
sequence. Independently from the intention of the paper, using 
popularity is a helpful metric to get insights into trends about what 
people are searching for. Online course suppliers can use this 
information to create online courses for a large audience, those 
sizes can be estimated with the search popularity. As data-driven 
approaches, e.g. Al-related decisions require lots of participants, 
offering online courses that are of high interest can help to get the 
required number of participants to have enough training data for AI 
methods. From the researchers’ perspective having popular courses 
is of high interest to obtain AI decisions with a high statistical 
significance. Sources like the Semantic Web do not provide this 
additional information. 


As this is ongoing research, the next step is to create a comparison 
of the identified cluster sequences with sequences that can be 
derived using the semantic web as proposed by Toman & Weddell 
[20]. This real-world experiment can show the applicableness in the 
field of education. If this method results in similar sequences, we 
recommend using an already existing semantic network and in case 
of missing concepts, we can use our method as a fallback. 


Observing the overlap of expert sequences, we can see that they are 
quite diverse. Finding an “optimum in education” is mostly a trade- 
off between different opinions of experts. We used the sequences 
to detect partial sequences that are similar across all experts. In the 
future, all topics should have a description of the taught contents to 
reduce the variety of sequences. Examining the detected partial 
sequences, we can see that these topics have a semantic connection 
and some topics have knowledge dependencies. In a future 
scenario, we recommend finding clusters of topics first and then use 
a text complexity metric like the GFI to get the optimal order. 
Otherwise, there might be a switch of topics, those order is good 
while looking at the complexity only, but could be confusing on a 
more global view. From the didactical perspective, switching 


between different topics in the learning path that have little 
semantic coherence is not recommended. 


In this paper, we focused on AlJ-related topics to present our 
research at an early stage. It is of high interest to compare our 
approach to data from another domain. We assume that the GFI as 
a complexity metric can be used as an indicator for a useful order 
as well. But it is important to note, it remains possible that GFI 
randomly happened to give a good result. Thus extending the 
experiment to different domains is necessary to give a final and 
scalable recommendation. 


Besides, we assumed that using text difficulty metrics will result in 
nearly the same order as their task is identical. Observing the 
results, we can see that there are major differences in the resulting 
order. The GFI is used to estimate how many years of formal 
education the reader needs to understand the text on the first reading 
[16]. In our case, it was the best and most practical metric. Looking 
at Figure 2, the Flesch-Reading-Ease is also of high importance but 
failed to create the optimal order of our three clusters (Table 2). 
Comparing the GFI with FRE, both metrics are based on syntactical 
features. The GFI is enriched with contextual features like “easy 
words”. This enrichment could be a reason why this index works 
best in our experiment. Besides, other textual metrics need to be 
taken into account for testing. Semantical features could be used as 
well as text entailment. In the future, combining these metrics can 
be beneficial, e.g. at training a neural network with all metrics to 
use non-linear dependencies, that were not examined in this paper 
yet. Textual metrics must be used carefully as they are “just” 
formulas for judging the complexity of texts [21]. The methods 
cannot be used to judge the appropriateness of contents or whether 
the content is correct. Thus, selecting learning material of high 
quality is important and the metrics are not useful in the selection 
process. 


The proposed approach depends on the SERPs of Google. Having 
a high fluctuation of rankings within the SERPs could change the 
feature's importance. As Google regularly updates their algorithms 
within a core update twice a year, rankings may change [22]. As we 
use the first 100 results we assume that the approach is robust 
because there are only minor changes if we consider the set of the 
first 100 pages. It is debatable that high-ranking Google results 
contain web pages of high authority, it can be discussed whether 
the first 100 resulting pages are a good resource for educational 
purposes and whether they are trustworthy. Instead, they are likely 
to be optimized for search engines, e.g. by search engine optimizers 
that create contents with ingoing links of high authority web pages 
aiming to have high rankings. There is the problem, that often texts 
of competitors are re-written for new pages to rank for similar 
terms. Thus, many texts with similar contents can be found. 
Besides, the SERP came from multiple contributors, they may 
include low-quality texts from commercial sources and web pages 
that block search engines are systematically excluded. 


We use Google as a proxy to get access to the web pages that 
contain the texts that we are working with. The same can be done 
with other search engines. Alternatively, being limited to resources 
those contents are created by editors of publishing houses for 
education may be biased as the complexity of texts also depends on 
the writing style of authors. Using a resource like the first 100 texts 
results in a more robust view to avoid this bias due to averaged data. 
It can be discussed whether Google is a good source for 
characterizing academic terms because SERPs might be too 
inclusive and therefore noisy. Based on our experiment we could 
see that a text difficulty average of the gathered data can be a good 
indicator. Whether this is the case, in general, has to be examined 
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in further experiments. Besides, we could use educational materials 
from publishing houses. In general, they are not publicly accessible, 
which increases the costs if we want to use them. Further research 
can examine whether resources like Wikipedia or Web of Science 
can be used with similar metrics to determine the optimal order. 


Limiting the approach to the header of courses could generally lead 
to wrong conclusions if the courses do not cover the topic that was 
given in the headline. In our experiments, we used topics as 
keywords only. Using the course description or the course(ware) 
content itself to obtain more rich information for having richer 
keywords could be beneficial, that will be addressed in further 
experiments. Besides, we did not consider synonyms, which should 
be observed in future studies because using different words (even 
synonyms) results in different SERPs. 


6. CONCLUSION 


In this paper, we propose different strategies to use texts that we 
got using a search engine to find the optimal order of online course 
topics. The pre-study has shown that the optimal topic sequences 
differ among experts. But we can also observe that there are partial 
topics that have the same order in all expert sequences. We 
identified them to define a gold standard and to check for the 
practical usefulness. The sequences derived by our approaches 
were compared to the expert ones and the order of the partial topics. 
The commercial popularity, that can be derived by searches in 
search engines is not an indicator of a good topic sequence. 
Searching for pair-wise topics and comparing the text complexity 
of the SERPs’ web pages’ texts can be used as an indicator for 
creating a plausible order of taught topics within online courses. 


We identified the Gunning Fox Index as the most robust metric for 
topic sequencing. We can conclude that this feature helps to find 
the optimal sequence for automatic composed online courses to 
personalize them ethically without using students giving them 
randomized learning paths that could impair their learning 
experience as well as their learning outcome. 
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ABSTRACT 


Recent work describes methods for systematic, data-driven 
improvement to instructional content and calls for diverse teams 
of learning engineers to implement and evaluate such 
improvements. Focusing on an approach called “design-loop 
adaptivity,” we consider the problem of how developers might use 
data to target or prioritize particular instructional content for 
improvement processes when faced with large portfolios of 
content and limited engineering resources to implement 
improvements. To do so, we consider two data-driven metrics that 
may capture different facets of how instructional content is 
“working.” The first is a measure of the extent to which learners 
struggle to master target skills, and the second is a metric based 
on the difference in prediction performance between deep learning 
and more “traditional” approaches to knowledge tracing. This 
second metric may point learning engineers to workspaces that 
are, effectively, “too easy.” We illustrate aspects of the diversity 
of learning content and variability in learner performance often 
represented by large educational datasets. We suggest that 
“monolithic” treatment of such datasets in prediction tasks and 
other research endeavors may be missing out on important 
opportunities to drive improved learning within target systems. 


Keywords 


Design-loop adaptivity, deep knowledge tracing, Bayesian 
knowledge tracing, mastery learning, learning engineering. 


1. INTRODUCTION 


Recent work calls on researchers and developers, including teams 
of learning engineers [14, 26], to focus on “explanatory” models 
of learners [25] and “design-loop adaptivity” processes [1, 15] to 
practically improve learning systems. While researchers describe 
specific examples of how explanatory learner models and design- 
loop adaptivity can be used to drive improvements to instruction, 
less (if any) attention has been paid in the literature to the 
practical problem of how content developers and learning 
engineers target and prioritize content for improvement. 


We focus on cases in which a target system has a large portfolio 
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of content, elements of which must be prioritized and targeted for 
improvement given finite learning engineering and software 
development resources. We present a case study using a data set 
that is among the largest considered in the literature on knowledge 
tracing and related methods [9, 18, 22], comprised of middle 
school and high school student work over an academic year on 
several hundred mathematics topics, each generally completed by 
thousands of students, generating several hundred million data 
points tracking student actions. We motivate, describe, and 
illustrate two approaches to targeting content for improvement 
within this portfolio, focusing primarily on what Aleven et al. [1] 
call “design-loop adaptation to student knowledge,” relying on 
large-scale data to find similarities amongst learners we might 
leverage to redesign instructional content for better learning. 


One targeting method is based on a measure of the extent to which 
learners tend to struggle with particular pieces of content, and we 
contrast it with an approach based on the relative prediction 
performance of deep learning models (i.e, Deep Knowledge 
Tracing; DKT [18, 22]) compared to traditional Bayesian 
Knowledge Tracing (BKT; [9]) models. 


The first method targets content students struggle to learn, relying 
on measures of knowledge component (KC [19]; or skill) mastery 
that are internal to the target intelligent tutoring system (ITS). In 
contrast, the second method is roughly motivated by the idea that 
identifying content in which there is a large difference in 
performance between deep learning and traditional Bayesian 
approaches may suggest areas in which deep learning can 
leverage statistical regularities in students’ performance that could 
point to improvements in the KC models that are used to drive 
adaptation with BKT. Such performance differences may suggest 
a particular focus area for KC model improvements. Relative 
DKT performance versus BKT performance also provides an 
instance of a metric that is perhaps less dependent on how the run- 
time ITS has “set the bar” for success in terms of KC mastery. 


In exploring these two approaches, we illustrate the variability in 
learning content and experiences within widely deployed systems 
like Carnegie Learning’s MATHia (formerly Cognitive Tutor) 
[23]. While different facets of variation may at times call for 
different approaches to content improvement (e.g., variation in 
student motivation could call for redesigns that discourage 
“gaming the system” [3]), our present work explores how to guide 
learning engineers’ “attention” to particular pieces of content to 
then consider specific improvements via processes for design-loop 
adaptivity [1, 15]. 


Original contributions of this work are two-fold: (1) We describe 
a novel problem in the literature related to how to target 
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instructional content improvement or design-loop adaptivity and 
explore two targeting approaches, and (2) we shed light on 
opportunities in treating large-scale educational datasets that may 
be missed by treating such datasets as “monolithic” targets for 
data-intensive approaches. Treating datasets in a “monolithic” 
way, though not a universal practice (e.g., [4-5, 10]) may inhibit 
practical progress in learning engineering. 


In addition to considering one of the largest-scale applications of 
DKT (and BKT) modeling in the literature, we illuminate avenues 
for research at the intersection of educational data science and 
learning engineering at scale in a widely-deployed adaptive 
learning platform for K-12 mathematics. We seek to amplify 
extant calls for a more nuanced approach to work on performance 
prediction [15, 25] while illustrating solutions to practical 
problems in learning engineering and product improvement. 


2. DESIGN-LOOP ADAPTIVITY 
2.1 Background 


A recent survey of adaptive instructional technologies [1] 
describes three categories along which learners’ experience can be 
varied, including “step-loop adaptivity,” “task-loop adaptivity,” 
and “design-loop adaptivity.” Step-loop adaptivity and task-loop 
adaptivity roughly correspond to “inner” and “outer” loop 
adaptive functionality in ITSs distinguished by VanLehn (e.g., 
[28]), respectively. We briefly describe step-loop and task-loop 
adaptivity before considering design-look adaptivity. 


Step-loop or inner-loop adaptivity enables an adaptive 
instructional system or ITS to provide support to learners within a 
particular learning task based on their performance (e.g., 
providing context-sensitive hints or just-in-time feedback within a 
math problem based on learner responses). Task-loop or outer- 
loop adaptivity enable an instructional system to choose the next 
appropriate task for a learner based on a model of student 
learning and evolving estimates of a learner’s mastery of 
underlying competencies, skills, or KCs [19] based on a learner’s 
performance. Extensive educational data mining (EDM) literature 
considers, for example, variants of and data-driven parameter 
optimizations for BKT (e.g., [18]), which can be used to select 
tasks for learners as their mastery of KCs evolves. 


In their recent survey, Aleven and colleagues describe design-loop 
adaptivity as involving 


data-driven decisions made by course designers before 
and between iterations of system design, in which a... 
system is updated based on data about student learning, 
specifically, data collected with the same system... [1]. 


They go on to describe goals toward which design-loop 
adaptations might be made, including adaptations to student 
knowledge, affect and motivation, student strategies and errors, 
and self-regulated learning, providing examples of each. 
Canonical examples of design-loop adaptivity or adaptation to 
student knowledge, the goal of our present targeting and 
prioritization endeavor, generally involve situations in which 
content within tutoring systems or online courses are improved by 
refining the fine-grained KC models that drive the adaptive 
experience of learners using a combination of data and human 
expertise [17, 20, 27]. 


Design-loop adaptivity for motivation and affect might drive 
content or system design and redesign to discourage off-task 
behavior [4] and “gaming the system” [3], wherein students 


attempt to make progress in a system by taking advantage of 
system features like hints, rather than making genuine attempts to 
master content. Aleven et al. [1] suggest that an approach to 
modeling gaming the system behavior based on a large-scale 
survey of the extent to which gaming the system [3] manifests 
across topics (what we will refer to as “workspaces”) in an 
intelligent tutoring system like MATHia provides a foundation for 
future design-loop adaptivity investigations. One important facet 
of this work (and related work on off-task behavior [4]) is its 
appreciation of the extent to which there is variability in how 
learning occurs across different (types of) content within adaptive 
instructional systems. Appreciating and surveying this variability 
is vital to ascertaining where, within large portfolios of content, to 
target design-loop adaptivity efforts and related data-driven, 
instructional improvement efforts. 


2.2 A Process for Design-Loop Adaptivity 
Huang et al. [15] describe a systematic approach to design-loop 
adaptivity or data-driven instructional redesign and improvement. 
They suggest three general goals for such redesign efforts. For a 
particular piece of content in an ITS or similar adaptive 
instructional system with a KC model, the goals are: (1) refine the 
KC model for the target content, (2) redesign the content, and (3) 
optimize individualized learning within the content. Existing 
EDM methods and novel analyses are then described to achieve 
each of these goals, targeting an “Algebraic Expressions” unit of 
content within the Mathtutor ITS [2]. For example, KC models 
can be refined using data-driven, computationally intensive 
methods like Learning Factors Analysis (LFA; [8]) or a simpler 
approximation of such an approach that uses regression 
techniques called “difficulty factor effect analysis” by Huang et 
al. [15]. Human expertise also plays an important role in such 
refinements, including in setting up data-driven analyses to 
produce meaningful results, interpreting these results for inclusion 
in potential task redesigns, and often in providing suggested 
refinements for target tasks. 


Huang et al. [15] demonstrate that redesigned content improves 
learning as measured by pre-tests and post-tests. Broadly, these 
goals align with on-going, data-driven content improvement 
efforts pursued by learning engineers working with MATHia. 
Nevertheless, the process of design-loop adaptivity generally 
requires extensive human and computational resources to be 
carried out in ways that will drive improved instructional 
effectiveness. The present work seeks to illustrate how EDM 
techniques might help improve targeting this process. 


3. MATHia 


3.1 Learning Platform 

Carnegie Learning’s MATHia [23] is an ITS used by hundreds of 
thousands of learners each year, mostly in middle and high school 
classrooms as a part of a blended math curriculum that combines 
collaborative work guided by instructors and Carnegie Learning’s 
MATHbook worktexts (60% of instructional time in recommended 
implementations) with individual work in MATHia (40% of 
instructional time). Nevertheless, usage of MATHia, contexts in 
which it is used, and other implementation details vary across a 
diverse, nationwide user-base. 


Grade levels of content in MATHia (e.g., Grade 7, Algebra I) are 
organized into a series of “modules,” each of which is comprised 
of a series of “units.” Units are composed of a series of 
“workspaces.” Workspaces represent the underlying unit of 
learner progress to mastery in MATHia. Each workspace presents 
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a set of problems associated with a set of KCs; student progress 
within the system is determined by students’ achievement of 
mastery of all of the KCs associated with a particular workspace, 
estimated by MATHia using BKT (see §3.2). Learning 
experiences vary substantially between workspaces with respect 
to design patterns, content areas, types of practice and instruction 
provided, (quality of) KC models intended to practice such 
content (e.g., some the result of years of iterative refinements, 
others introduced more recently), BKT parameters, and other 
parameters that drive task selection and mastery judgment. 


Consider the problem solving task illustrated in Figures 1 and 2. 
Figure 1 illustrates the workspace “Modeling the Constant of 
Proportionality.” In this workspace, students are provided with a 
word problem and several associated questions (left pane; Figure 
1). On the right-hand side of Figure 1, tools are presented to solve 
the problem’s “steps.” There is a worksheet or table in which they 
can provide units of measurement, responses to questions, and 
fields in which to write expressions to model the problem’s 
scenario. After they have completed entries in the worksheet, 
students work with a graphing tool. Each problem-step in the ITS 
can provide context-sensitive hints upon request as well as just-in- 
time feedback that tracks errors that students often make. Most 
problem-steps are mapped to KCs, for which MATHia provides 
an evolving mastery estimate to adapt problem selection to the 
individual student’s needs (see §3.2). 


Contrast the learning experience of the problem in Figure 1 with 
that of Figure 2. “Modeling the Constant of Proportionality” 
(Figure 1) involves substantive reading, modeling the problem 
scenario via algebraic expressions, working through concrete 
instances of these expressions, and using a graphing tool. Figure 2 
illustrates problem-solving in a menu-based equation “solver” 
workspace, “Solving with the Distributive Property Over 
Multiplication.” Here the student is tasked with solving for x in 
the equation 65 = 10 (x + 6). There is little reading and no context 
provided for the equation, but hints and just-in-time feedback are 
available. Learners’ progress toward mastery is tracked for a 
different set of KCs. The menu-based solver constrains possible 
student actions at various points in the equation-solving process 
compared to the typed-in input that students provide in the 
worksheet in Figure 1. Far from an exhaustive list, we seek to 
illustrate a few from among substantial differences in types of 
content provided, design patterns, interaction modalities, 
underlying KC models, and tools available, even within the 
relatively constrained domain of math, any of which may have 
important impacts on inferences that might be drawn from data or 
the ability of different methods to predict performance and 
learning within such content. While any of the features in these 
examples might reasonably be refined as a part of the design-loop 
adaptivity or content improvement process, we leave to future 
work the data-driven targeting of specific improvements within a 
workspace. We consider how to target specific “workspaces” for 
design-loop adaptivity improvements. 


3.2 Knowledge Tracing & Mastery Learning 

BKT [9] posits a binary (i.e., “mastered” or “unmastered”) 
knowledge state for each independently modeled KC and can be 
formalized as a four-parameter hidden Markov model. One 
parameter represents the probability that a learner has already 
mastered a KC before their first opportunity to practice it. A 
second parameter represents the probability that a learner 
transitions from the unmastered to the mastered state at any 
particular KC practice opportunity. Two parameters link the 


knowledge state to observable outcomes at any KC practice 
opportunity: the probability that a student is in the unmastered 
state and responds correctly (“guessing”) and the probability that 
a student is in the mastered state and answers incorrectly 
(“slipping”). Extensive EDM literature has explored the data- 
driven fitting of BKT parameters as well as individualized (e.g., 
[30]) and more sophisticated variants of this approach (e.g., [18]). 


Figure 1. Problem-solving screenshot from a MATHia 
workspace called “Modeling the Constant of Proportionality.” 


65 =10( x+6) 


Figure 2. Screenshot from the MATHia workspace “Solving 
with the Distributive Property Over Multiplication.” 


Based on parameter settings and performance data collected as a 
student practices each KC, the system can use BKT to infer and 
update estimates of the probability that a student is in the 
“mastered” state for any particular KC. Typically, systems set a 
threshold for mastery (often 0.95, as in MATHia); if the system’s 
estimate that the probability a student has mastered a particular 
KC is above the threshold, then the system considers the KC 
mastered for that student. 


Relying on evolving estimates of learner KC mastery, 
instructional systems can use knowledge tracing frameworks like 
BKT to drive “task-loop” (or “outer loop”) adaptivity [1, 28] and 
mastery learning [7, 24]. After a student completes a problem (or 
task; like the problems illustrated in Figures 1-2), the system can 
select the next problem based on KCs that a student has yet to 
master. In this way, systems can adapt to the student’s evolving 
mastery of KCs, providing (ideally) just enough practice for 
students to master KCs and avoiding cases in which the system 
provides too little or too much practice. 


Implementing self-paced mastery learning [7, 24], MATHia 
provides practice to a student until they have either mastered all 
KCs associated with a particular workspace or they have reached 
the maximum number of problems that designers have specified 
for a particular workspace. Once the student masters all of the 
KCs in a particular workspace (or reaches the max number of 
problems), they are moved on to the next workspace in an 
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assigned content sequence. Teachers are alerted when students 
reach the max number of problems in a workspace without 
reaching mastery. Setting a max number of problems ensures that 
students do not endlessly struggle unproductively within a piece 
of content [11]. 


3.3 Data 

We consider data from 252,036 learners who used MATHia 
during the 2018-19 academic year and completed at least one of 
308 workspaces that track KC mastery across math content for 
Grades 6-8, Algebra I, Algebra II, and Geometry. These data 
account for approximately 3.8 million workspace completions. 
Models are learned over subsets of 267,419,999 student actions 
(i.e., first-attempts, including hint requests) at problem-steps 
mapped to KCs. Over the 308 workspaces, MATHia tracks 2,152 
KCs. Table | provides summary statistics. 


Table 1. Summary statistics for 308 MATHia workspaces in 
2018-2019; “KCs” = # KCs tracked; “Comps.” = # student- 
workspace-completions; “Actions” = sum across all students 
completing workspace of count of first attempts (including 
hint requests) at problem-steps within workspace problems. 


Min. Ql Med. Q3 Max. 
KCs 2 5 6 9 15 
Comps. | 167 4275 9414 18801 51097 
Actions | 5530 | 197757 | 489159 | 1278325 | 7191034 


When working with large, complex datasets, it is essential to 
focus learning engineering efforts on the portions of the system 
for which improvements can be most impactful. Rather than 
consider such a broad dataset as a single monolithic target, 
especially for performance prediction modeling in §4.2, we learn 
models for each workspace within the dataset; input data are 
sequences of correctness labels for learner actions (e.g., binary 
correct or incorrect, where incorrect includes both errors and hint 
requests) and labels for KCs mapped to each action. 


4. METRICS FOR TARGETING 
IMPROVEMENTS 


As illustrated in Figures 1 and 2, workspace-to-workspace 
variability in learning experiences is substantial. Types of practice 
vary (e.g., equation solving, graphing, etc.), and developers make 
a plethora of design choices in creating content. Some workspaces 
require more reading; KC models vary in complexity, and some 
have been iteratively refined over the course of nearly two 
decades while others are newly deployed in a given year. Given 
this variation and the nature of grade-level content standards, 
there is also variability in the extent to which learners find 
particular content difficult. 


Leamer difficulties manifest at the problem-step level in the form 
of problem-solving errors and hint requests and at the workspace 
level in at least two ways: (1) that some learners require a greater 
number of problems to achieve mastery of all KCs, and (2) that 
some learners reach the maximum number of problems set by 
designers without having achieved mastery of all KCs. These 
latter students are moved along within their curriculum sequence 
without mastery. Teachers are alerted of this failure to reach 
mastery via reporting analytics available to them as well as in the 
LiveLab teacher companion app to MATHia. Some students fail 
to reach mastery in a workspace because of genuine difficulty 
with presented math content, but relatively frequent instances of 


such failure to reach mastery often indicate that content 
improvements (i.e., design-loop adaptivity) is called for to 
enhance experiences for learners. 


Prior research considers MATHia’s workspace level as a unit of 
analysis. Researchers have focused on associations between 
characteristics of Cognitive Tutor “lessons” (MATHia’s 
workspaces) and learners’ affective states like confusion and 
frustration [10] as well as the extent to which students go off-task 
[4] and game the system [5]. In what follows, we adopt an 
approach similar in spirit to this literature by considering a large 
corpus of MATHia data as broken down into workspaces rather 
than treating the entire dataset in a monolithic fashion. 


The first metric we consider helps identify content that is 
instructionally ineffective in ways that manifest as difficulty for 
learners to successfully complete the content. In considering the 
second metric, we explore one example where the metric may be 
providing some insights into places where content is not 
“difficult” (1.e., measures of difficulty do not “raise flags” about 
improvement needs) but where design-loop adaptivity 
improvements might drastically improve student learning. 


4.1 Proportion of Failures to Reach Mastery 
The first design-loop adaptivity targeting metric we consider is 
the proportion of learners who fail to reach mastery of at least one 
of the KCs associated with a workspace before reaching the 
maximum number of problems set by content designers. Figure 3 
provides a histogram showing the overall distribution of this 
proportion across workspaces. The median workspace has 4.3% of 
students fail to reach mastery of all its KCs (minimum = 0%; Q1 
= .7%; Q3 = 12.1%; maximum = 77.7%). 
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Figure 3. Histogram illustrating the distribution of the 
proportion of students failing to reach mastery of all KCs 
associated with 308 workspaces in the 2018-19 academic year. 


Fancsali et al. [11] argue that students’ failure to achieve mastery 
at a level of aggregation like that of a workspace is an important 
outcome for predictive modeling, mostly overlooked in the 
literature on so-called “wheel spinning” (e.g., [6]), which tends to 
develop models to predict whether students will master particular 
KCs in a tutoring system, ignoring other elements of how 
instructional content is presented. Fancsali et al. argue that, given 
the clustering of KCs within problems, the clustering of problems 
within workspaces, and the fact that workspaces are the unit at 
which learners make progress in ITSs like MATHia, reporting 
outcomes like the count and percentage of KCs that student fail to 
master (a la Beck and Gong [6]) is of dubious practical value. 
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Since design-loop adaptivity improvements are likely to often 
involve redesign of instructional content, we similarly contend 
that measures closely aligned to instructional delivery are likely to 
be helpful in targeting this process. Large proportions of students 
failing to master instructional content are likely to be important in 
determining what learning content to improve with limited 
resources. This metric serves as a foil to a second approach. 


4.2 DKT vs. BKT Prediction Performance 


Extensive recent literature (e.g., [12, 18, 22]) considers deep 
learning approaches to the problem of predicting student 
performance at fine-grained opportunities to demonstrate mastery 
of KCs in learning systems like ASSISTments [13]. DKT [22] has 
been compared (e.g., [12, 18] to BKT and logistic regression 
approaches to the same type of prediction task [21]. Early work 
demonstrated that DKT generally had superior prediction 
performance compared to BKT [22], but subsequent literature also 
suggests that variations of BKT (e.g., modeling “forgetting”) and 
logistic regression approaches can bridge some, if not most, of the 
gap in prediction performance (e.g., [18, 29]). 


Nevertheless, we seek to better understand the extent to which 
DKT  out-performs BKT when considered workspace-by- 
workspace across a large dataset from MATHia, which presents a 
wide variety of learning experiences. We find that, for a variety of 
workspaces, classic BKT’s performance is often comparable to 
DKT even without accoutrements added in the work of Khajah et 
al. [18]. Further, in keeping with our primary concern in the 
present work, we explore the extent to which observed differences 
in performance between the two approaches, especially examples 
of DKT’s far superior prediction performance, might serve as a 
metric for targeting improvement work for MATHia workspaces, 
possibly indicating an especially flawed KC model. 


4.2.1 Modeling Approach 

We rely on the Khajah et al. [18] implementation of DKT with 
long short-term memory (LSTM) recurrent units.' We use 
Yudelson’s hmm-scalable? implementation of classic BKT 
parameter fitting using expectation maximization [30]. We learn 
DKT and BKT models for each of the 308 workspaces, splitting 
the data for each workspace into training and test sets with a 80%- 
20% student-level split and calculate the AUC (area under the 
receiver-operating characteristic curve) on the test set following 
methods in Khajah et al. [18]. BKT and DKT models are trained 
and tested on the same datasets. AUC is a measure of the extent to 
which a model can “discriminate” between or predict students’ 
correct and incorrect responses in the held-out test set. An AUC 
value of 0.5 indicates “chance” ability to discriminate between 
two classes; a value of 1.0 indicates perfect discrimination. 


4.2.2 Results 

Table 2 provides summary statistics for AUC performance for 
DKT, BKT, and AUC differences of these methods over all 
workspaces. As expected, DKT generally provides superior 
prediction performance to classic BKT over the 308 workspaces. 
However, there is substantial variability, with classic BKT in 
some cases, albeit many (but not all) with relatively small sample 
sizes, even out-performing DKT. While there is a modest, 
statistically significant positive correlation between the AUC 
difference in DKT and BKT and sample size (i.e., the number of 


' https://github.com/mmkhajah/dkt 
? https://github.com/myudelson/hmm-scalable 


student-sequences available for training and testing) (r = .2; p < 
.001), BKT performs comparably to DKT on a number of 
workspaces with tens of thousands of students’ data, and BKT 
only underperforms DKT by approximately .07 AUC units for the 
median workspace. The Q1 value for this difference (the greatest 
difference over 77 workspaces) is approximately in line, in terms 
of AUC units, with a value (.03 AUC units) declared comparable 
by Khajah et al. [18] for BKT “variants” compared to DKT. 


The difference in AUC between DKT and BKT is uncorrelated 
with the proportion of students who fail to reach mastery (r = -.05; 
p = -4) and is thus not an indicator of the relative difficulty of 
particular workspaces, regardless of the source of difficulty. 


Table 2. Summary statistics for AUC performance over 308 
workspaces of DKT and BKT models and of the difference 
between DKT and BKT performance (A); negative minimum 
value indicates better BKT performance for some workspaces. 


AUC Min. Ql Med. Q3 Max. 
DKT 5852 .7839 8331 8783 .9763 
BKT 5150 .7045 .7456 .7854 .9563 

A -.0802 .0361 .0676 1281 3073 


4.2.3, Practical Promise 

We consider two observations relating to workspace design 
patterns that emerge from considering workspaces with the largest 
differences in terms of DKT’s (generally better) prediction 
performance compared to BKT. First, we consider the design of a 
particular workspace as a prime target for design-loop adaptivity 
to student knowledge, motivation, and affect. Second, we consider 
more general design patterns in workspaces on which DKT and 
BKT performance differences are greatest, suggesting more 
“macro-level” design-loop adaptivity that may affect broader 
categories of workspaces. 


4.2.3.1 Example Workspace 

The second greatest observed difference in AUC occurred for the 
workspace “Checking Solutions to Linear Equations” (DKT AUC 
= .968; BKT AUC = .684). A mere 0.2% of students fail to master 
all KCs in this workspace, suggesting that it may not be “flagged” 
for design-loop adaptivity improvements based on difficulty. 
Nevertheless, careful inspection of the workspace yields several 
areas for improvement. 


This workspace presents students with problems (See Figure 4) 
like: “Jordan solved the equation -3u — 8 = 10. She calculated u = 
-6. Use the Solver to check Jordan’s solution.” The student is then 
presented with a menu-based equation solver. Work with the 
equation solver should involve the student substituting in the 
solution value from the problem presentation and checking 
whether the result is a balanced equation. After choosing 
“Substitute for variable” from the menu, the student then must 
input a value on the left-hand side of the equation (see Figure 5). 


Problems in this workspace present both correct and incorrect 
cases, but the KC model does not distinguish between correct and 
incorrect cases, making problems with a correct solution targets 
for possible gaming the system. For example, in the problem in 
Figure 5, the student might enter /0 to complete “J0 = 10.” This 
response may not reflect having correctly carried out the variable 
substitution to arrive at this solution. KCs in this workspace are 
also not currently mapped to work in the solver; the solver 
provides hints and just-in-time feedback on errors, but it is not 
instrumented to track KC mastery. Once the student has entered 
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the appropriate value, two questions appear on the left of the 
screen (see Figure 6), and student responses to these questions 
trigger updates to two KCs at a time. 


Figure 4. Screenshot in the MATHia workspace “Checking 
Solutions to Linear Equations.” 


Figure 5. Screenshot after the student has selected “Substitute 
for variable” from the equation solving menu (see Figure 4). 


Figure 6. Screenshot after the student has entered the value 10 
on the left-hand side of the equation (see Figure 5). 


Following design-loop adaptivity steps laid out by Huang et al. 
[15], we have identified (1) where the KC model can be refined, 
and (2) areas for task redesign. The third step would involve 
fitting a BKT model using the hypothetical re-mapping of KCs to 
steps within problems in existing data to determine whether the 
hypothesized, refined KC model fits the data better than the 
existing KC model. Future work will experimentally test 
workspace redesigns to “close the loop” (cf. [16, 25]) between this 
data-driven approach and empirical learning outcomes. 


4.2.3.2 Prominent Design Patterns 

Patterns emerge in comparing the performance of DKT to BKT 
over 308 workspaces. In the top twenty workspaces in which 
DKT outperforms BKT, differences in AUC units range from .307 
to .212, and all provide constrained input mechanisms relative to 


broader MATHia content. Fourteen workspaces (70%) involve 
equation solving, and the others are split between those that 
involve placing values on a number line and those in which 
problem input is provided via drop-down menus. 


Gervet et al. raise questions about explanations for observed 
properties of DKT in predicting student performance. Can DKT, 
for example, “better pick up on local patterns of student behavior 
like gaming the systems” [12]? While far from conclusive, DKT’s 
performance for the workspace “Checking Solutions to Linear 
Equations” could exemplify this phenomenon. Workspaces with 
more constrained inputs may provide examples where DKT 
“picks up” on local patterns that BKT does not. Future work ought 
to investigate whether these particular types of relatively 
constrained input mechanisms are easy to “game” or whether and 
how DKT learns local performance patterns. 


Equation solver and number line workspaces are widespread in 
the top workspaces in which DKT outperforms BKT. “Checking 
Solutions to Linear Equations” has readily apparent flaws, 
suggesting that our approach may be promising in targeting 
instructional improvement work. Systematic review of these 
results remains future work. 


5. DISCUSSION 

There are numerous questions for future research. That 
differences in AUC between DKT and BKT are uncorrelated with 
an important measure of instructional ineffectiveness, combined 
with DKT’s ability to find regularities in data that are not found 
by BKT suggests that this difference may be signaling important 
workspace characteristics. Analysis of a particular workspace 
(§4.2.3.1) suggests that DKT-BKT differences may signal 
inadequacies in the KC model. These findings can be compared to 
the results of data-driven search for better KC models [8]. 
Improvements can be made to the workspace, and A/B tests can 
“close the loop” and establish more effective approaches. 


Systematic analysis of instructional content and prediction 
performance differences in DKT and BKT might follow work that 
explores a space of properties and features of particular tutor 
“lessons” to determine which predict students’ affect, gaming the 
system, and off-task behavior [4-5, 10]. Comparisons to logistic 
regression methods (e.g., [12]) are also needed. 


Naive learning engineering may focus on reducing students’ 
mastery failures. Such an approach could lead to “over- 
simplified” tasks that don’t produce failure because they don’t 
require much knowledge. Large differences between DKT and 
BKT may help identify over-simplified workspaces that provide 
opportunities for students to game the system [3, 12]. To what 
extent do gaps in modeling techniques’ performance indicate 
unproductive patterns of “local” behavior in particular 
workspaces? What else drives differences? What other behavior 
patterns indicate ways to target improvement? 


Methodologically, our “non-monolithic” analysis of a large 
educational data set treats component instructional experiences as 
units for analysis. Such analytical decomposition is vital to 
practical learning engineering to improve instructional systems 
and large portfolios of content used by learners every day. 
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ABSTRACT 


Knowledge tracing algorithms are embedded in Intelligent 
Tutoring Systems (ITS) to keep track of students’ learn- 
ing process. While knowledge tracing models have been 
extensively studied in offline settings, very little work has 
explored their use in online settings. This is primarily be- 
cause conducting experiments to evaluate and select knowl- 
edge tracing models in classroom settings is expensive. To 
fill this gap, we introduce a novel way of using machine- 
learning models to generate simulated students. We con- 
duct experiments using agents generated by the Apprentice 
Learner Architecture to investigate the online use of differ- 
ent knowledge tracing models (Bayesian Knowledge Tracing, 
the Streak model, and Deep Knowledge Tracing). An anal- 
ysis of our simulation results revealed an error in the initial 
implementation of our Bayesian knowledge tracing model 
that was not identified in our previous work. Our simula- 
tions also revealed a more fundamental limitation of Deep 
Knowledge Tracing that prevents the model from supporting 
mastery learning on multi-step problems. Together, these 
two findings suggest that Apprentice agents provide a prac- 
tical means of evaluating knowledge tracing models prior to 
more costly classroom testing. Lastly, our analysis identi- 
fies a positive correlation between the Bayesian knowledge 
tracing parameters estimated from human data and the pa- 
rameters estimated from simulated learners. This suggests 
that model parameters might be initialized using simulated 
data when no human-student data is yet available. 


Keywords 
Computational Models of Learning, Simulated Students, 
Knowledge Tracing 


1. INTRODUCTION 

Intelligent Tutoring Systems (ITS) are used within K-12 ed- 
ucation to improve learning outcomes. In addition to provid- 
ing students with scaffolding and feedback, tutors utilize an 
approach called knowledge tracing to estimate what students 
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know and do not know [2]. When combined with a problem 
selection policy [16], knowledge tracing enable tutors to sup- 
port mastery learning and to focus students practice where it 
is most needed (i.e., on the skills they do not yet know rather 
than the skills they already know). While many studies 
have explored knowledge tracing for offline evaluation (fit- 
ting knowledge tracing models to existing data sets), there 
is comparatively little work on evaluating these algorithms 
in online settings (evaluating how well these algorithms es- 
timate students’ mastery from just a few data points and 
decide when to stop giving them additional problems). 


We aim to understand which knowledge tracing models yield 
the greatest mastery learning efficiency in online settings. 
Additionally, we want to find out how the parameters for 
knowledge tracing models can be selected before human data 
is collected. To meet our need for multiple experiments to 
investigate our knowledge tracing questions, we introduce 
a novel way of using computational models of learning, or 
simulated student models that learn from interactions with 
a tutor just like human students do, to simulate our knowl- 
edge tracing experiments. We use the Apprentice Learner 
architecture [10], a machine-learning framework that aims 
to model how humans learn from examples and feedback to 
generate simulated students and conduct experiments. 


To explore the feasibility of this approach, we conducted ex- 
periments to compare Bayesian Knowledge Tracing (BKT)[2] 
to the Streak model [7] and Deep Knowledge Tracing (DKT) 
[15]. Our simulations show that both BKT and Streak stop 
before giving all the problems, but that BKT is slightly more 
aggressive than Streak and seems to assume students have 
mastered skills a bit earlier than expected. Upon further 
inspection, our analysis revealed a bug in our underlying 
implementation of BKT (which we fixed for this study). 
Further, we found that DKT exhibits strange behavior that 
makes it unusable in certain cases of mastery learning and 
problem selection. This limitation of DKT for mastery learn- 
ing has not been identified in prior work. These findings 
demonstrate that simulation students might serve a valu- 
able role in testing knowledge tracing models before more 
costly classroom deployments. 


We also explore the use of simulated data from these exper- 
iments to estimate initial parameters for the BKT model. 
Prior to collecting human data, knowledge tracing param- 
eters are often set to reasonable hand-picked defaults. A 
better approach is to run a pilot study with human students 
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to collect data for model training, which requires additional 
time and labor. Our analysis identifies a positive correlation 
between the BKT parameters estimated from the simulated 
data and those estimated from human data, suggesting sim- 
ulated data might be used to initialize parameters. 


2. BACKGROUND 
2.1 Knowledge Tracing 


The main purpose of knowledge tracing is to track students’ 
skill mastery and predict their future performance based on 
their past activity. Knowledge tracing uses labels of the 
skills needed for each step, which we call Knowledge Compo- 
nents (KCs), and students’ first attempt correctness (correct 
or incorrect) to predict whether the students will correctly 
solve a new problem containing the same KC. 


2.2 Bayesian Knowledge Tracing 

BKT [2] is a well-known knowledge tracing algorithm that 
can estimates whether students have learned a particular 
skill given their past performance on opportunities to prac- 
tice that skill. It models student knowledge using a Hidden 
Markov Model, where the hidden state is estimated from 
observations of students’ correctness on each step. For each 
skill, a student can be in one of the two possible knowl- 
edge states: “unknown” or “known”. A binary response (cor- 
rect or incorrect) is generated at each opportunity a student 
practices a skill [5]. Although, BKT supports the ability to 
model forgetting, it is typically assumed that students never 
forget what they have mastered [16]. If a student reaches 
95% probability of being in the known state for a skill, the 
skill is marked as being mastered by the student. With these 
assumptions, the BKT model has four parameters. 

e P(ZLo): initial probability of mastery (“known”). 

e P(T): probability of learning the skill (“learn”). 

e P(G): probability of guessing the answer (“guess”). 

e P(S): probability of making a mistake (“slip”). 
Researches have created variants of the BKT model. Yudel- 
son et al. [20] introduced an individualized BKT model that 
can take student differences in initial mastery and skill learn- 
ing probabilities into account. Nedungadi et al. [13] created 
PC-BKT (Personalized and Clustered), which has individual 
priors for each student and skill, and dynamically clusters 
students based on learning ability. This prior work aims to 
improve predictive performance over original BKT. Despite 
the quantity of research on BKT models, there is relatively 
little evaluation in online settings. 


2.3 Streak Model 


Another knowledge tracing approach that is popular for use 
within mastery learning is the Streak model. Also known 
as “three-in-a-row” [7], it is a relatively simple and intuitive 
model since it only has one parameter, how many correct 
answers in a row equates to mastery. It was first applied in 
ASSISTments and the key idea was to keep giving the stu- 
dent questions until some proficiency threshold was reached. 
The default setting was “three correct in a row” but this 
could be manipulated by teachers. 


2.4 Deep Knowledge Tracing 
DKT [15] is a knowledge tracing model that has been receiv- 
ing increasing attention. This model leverages information 


about students sequence of steps and correctness on those 
steps to predict performance on subsequent steps. Addition- 
ally, DKT leverages information about performance on one 
skill to improve predictive performance on other skills. DKT 
uses a long short-term memory (LSTM) architecture, which 
is a kind of recurrent neural network that allows for the 
modeling of non-Markovian processes. Recent work evalu- 
ating DKT [6, 8] suggests that DKT often outperforms BKT 
in terms of predictive performance. Despite this promising 
finding, there has been very little work exploring the use 
of DKT within online knowledge tracing (e.g., for mastery 
learning within a tutor). Beyond the original DKT work 
[15], which explores the use of DKT for next step recom- 
mendation, we are unaware of any research programs that 
currently uses DKT in this way. 


Finally, it is worth noting that knowledge tracing is not 
strictly necessary for tutoring systems. Many tutors either 
use a fixed problem sequence or present a fixed number 
of problems in a random order. However, we hypothesize 
that knowledge tracing in conjunction with mastery learn- 
ing component is one of the main components of tutors that 
makes tutors effective. 


2.5 Computational Model of Learning 

The Apprentice Learner Architecture is a framework for 
modeling human learning from demonstrations and feed- 
back in educational environments [10]. We use an Appren- 
tice model previously developed in prior work; see [11] for a 
complete description of the model. Most work in the field 
of educational data mining focuses on building mathemati- 
cal, predictive models of learning. In contrast, the Appren- 
tice models actually perform the task (not just predict per- 
formance). They induce task-specific knowledge from the 
demonstrations and feedback they receive. Apprentice mod- 
els are ideal for the current study because they do not require 
prior human data to operate. They can predict learning and 
behavior based solely on the task structure. 


3. METHODOLOGY 


We created 30 simulated students (Apprentice agents) to 
solve problems in a fraction arithmetic tutor (tutor pre- 
sented in [14]). The tutor had three different types of prob- 
lems: Add Different (AD), add fractions with different de- 
nominators; Add Same (AS), add fractions with same de- 
nominators; Multiplication (M), multiply two fractions. 


3.1 Experiment Design 

Our study had six conditions: Random, Streak, BK'T_default, 
BKT_random, BKT_human and DKT_random. There were 
four types of conditions: Random, BKT, Streak, and DKT, 
which differ in the way they select the next problem to give 
to a simulated student. In the Random condition every 
problem was assigned only once in random order and the 
training ends when problems run out. Since Random gives 
the most training, it produces the highest correctness pre- 
diction by the end of practice. We use Random as a baseline 
for evaluating other models. 


The other conditions use the respective knowledge tracing 
approaches for mastery learning and problem selection. Dur- 
ing problem selection, each knowledge tracing model ran- 
domly chooses a problem with at least one unmastered skill 
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Figure 1: Num. problems given in each condition (by type). 


and updates the student’s mastery level based on the result. 
The training ends once the proficiency threshold is reached 
(95% in all cases but Streak, where it was 3 in a row). 


The parameters in BKT_default were manually set based on 
our prior experience with BKT. We set P(Lo) to 0.1, P(T) to 
0.05, P(G) to 0.05 and P(S) to 0.02. These parameters are 
identical for each KC. The BKT_random and BKT_human 
parameters were estimated using the BKT module on Learn- 
Sphere [9]. The human-based parameters were obtained 
from fitting BKT to the “Fraction Addition and Multipli- 
cation” dataset accessed via DataShop (Koedinger et al., 
2010). The random-based parameters were obtained from 
fitting BKT to the log data generated from simulated stu- 
dents in the Random condition. 


To support the use of DKT within online master learning, 
we created our own implementation using PyTorch’s LSTM 
module.’ Based on prior work [8], the model has 200 nodes 
in the hidden layer, uses a dropout of 0.4 during training, 
and uses a batch size of 5 (our sequences were longer than 
those in prior work, so a smaller batch size works well). This 
implementation supports the ability to fit DKT to data pre- 
sented in standard DataShop [9] format. Trained models 
have a simple interface for use in online knowledge tracing 
settings. Similar to BKT, we fit DKT to the log data gen- 
erated from simulated students in the Random condition to 
estimate model parameters. 


3.2 Simulation Studies and Evaluation 

During the experiment process, we created 30 simulated stu- 
dents for each of the six conditions and analyzed the data 
that they generated. For these experiments, we created a 
KC model that labels each step as a combination of “Prob- 
lem Type” and “Selection”. There are 14 unique KCs in 
our analysis, 8 for Add Different, 3 for Add Same and 3 
for Multiplication. As the Additive Factors Model (AFM) 
is often used to examine learning curves from existing data 
[1], we used pyAFM [12] (a python implementation of AFM) 
to predict the probability that students will get a next step 
with the respective skill correct at the end of their practice. 
This provided an independent means for us to estimate how 
well each knowledge tracing approach did at appropriately 


1Open-source code for the model is available here: 
https://gitlab.cci.drexel.edu/teachable-ai-lab/dkt_torch. 


recognizing when students had achieved mastery. 
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Figure 2: Learning curves for four conditions. 


The AFM model assumes that performance monotonically 
converges to zero error in the tail. However, both humans 
and simulated students have non-zero error in the tail of 
their learning curve. This violation of the AFM model’s as- 
sumption causes the model to estimate lower learning rates 
in order to accommodate non-zero error in the tail. We 
found that because the simulated students sometimes get a 
much larger number of practice opportunities than human 
students (e.g., 80 vs. 30 practice opportunities), the bias 
in AFM’s learning rates was non-trivial. To address this 
challenge, we utilized the Additive Factors Model + Slip 
(AFM+S) approach [12], which explicitly models non-zero 
error rates in the tail using additional “slipping” parameters 
for each KC. The AFM+S model better fit the simulated 
student data from all six conditions than the AFM model 
(three fold cross-validated RMSE = 0.240 vs. 0.257). Quali- 
tatively, we found that the AFM-+ S learning curves seemed 
to better fit the data, particularly for slopes at the beginning 
of difficult to master skills. 


4. SIMULATION RESULTS 
4.1 Online Knowledge Tracing Results 


Figure 1 shows the numbers of problems administered by the 
tutoring system in each condition. Random always gives 
all 80 problems each type. Streak gives around 17 prob- 
lems for AD, 10 problems for AS and 9 problems for M. 
BKT_default gives around 11 problems for AD and 6 prob- 
lems for AS and M. BKT_random has the similar statis- 
tics to the BKT_default, while BKT_human gives around 
14 AD problems, 9 M problems and around 6 AS problems. 
DKT_random gives around 78 AD problems, almost as many 
as Random; however, it gives less than 3 problems in AS 
and M. The number of problems given by BKT_human is 
slightly higher than those given by BKT_random. We hy- 
pothesize that this is because the BKT_random parameters 
were fit specifically to the simulated students, so when used 
for knowledge tracing they provide better estimates of mas- 
tery than the the BKT_human parameters. 


To get a better sense of the overall differences between Ran- 
dom, Streak, BKT (BKT_random), and DKT, we plotted 
the overall learning curves for the data from these condi- 
tions, see Figure 2. We can see from this figure that BKT 
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Figure 3: The AFM-+S predicted probability at the end of training for each KC averaged over all students. 


stops giving practice earlier than Streak, which subsequently 
stops giving practice earlier than Random and DKT. We 
observe higher variance in the tail of the learning curves for 
BKT and streak because the total number of students is 
decreasing as each student reaches mastery. 


We applied the AFM+S model to predict performance on 
a hypothetical next opportunity for each KC and student. 
Figure 3 shows the average predicted correctness (across stu- 
dents) after the final practice opportunity for each skill. For 
most KCs, the prediction is higher than 95%, which sug- 
gests that mastery has been obtained in these KCs. Unfor- 
tunately the KC “AD Answer Denominator” has the lowest 
overall next-step correctness prediction in all six conditions. 
Figure 4 displays the learning curve for this skill across all 
six conditions and the number of students that have not 
yet mastered the skill at each point. This graph shows that 
there is a high slipping rate for this particular skill (see green 
line), indicating that there is a ceiling on the best possible 
AFM+4S prediction that can be achieved. 
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Figure 4: Learning curve for “AD Answer Denominator” and 
number of unmastered students at each opportunity. 


Conditions AD Average | AS Average | M Average 
Random 0.01 0.01 0.01 
Streak 0.06 0.11 0.12 
BKT_default 0.09 0.16 0.16 
BKT_random 0.10 0.17 0.15 
BKT_human 0.08 0.16 0.12 
DKT_random 0.01 0.30 0.31 


Table 1: Model efficiency scores across six conditions. 


To evaluate how well each approach handles the trade off be- 
tween maximizing student’s performance while minimizing 
the amount of practice, we computed a metric that we call 
“efficiency”, see Table 1. To compute the score, we divided 
the AFM-+5 predictions at the end of training by the num- 
ber of opportunities the student received for each student 
and KC. We then averaged over students to get a score for 
each KC. Finally, we averaged over KCs within each prob- 
lem type. This produced, 3 model efficiency scores for each 
of the six models. Bigger value refers to a more efficient 
model. The efficiency score complements accuracy and pro- 
vides more information for selecting the best model. 


Although Random gives the highest prediction in Figure 3 
among all KCs, it is the least efficient one as it gives all 
the problems during training. BKT_random has lower pre- 
dictions than Streak, however the model efficiency suggests 
that it is more efficient. DKT_random yields the same effi- 
ciency as Random in AD problems. However, for AS and M 
problems, it appears to have the highest efficiency across all 
six models since it takes the least practice, but still achieves 
a moderate correctness prediction. 


4.2 Simulated vs. Human BKT parameters 

To validate the feasibility of generating BKT parameters 
using simulated student data, we did a correlation analysis 
of the BKT_random and BKT_human parameters. Figure 
5 shows that there’s a positive correlation of around 0.65 
in the “Learn” parameter, which means the simulated stu- 
dents generated by Apprentice Learner have a similar learn- 
ing rate as human students. We argue that this is one of the 
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Figure 5: Positive correlation in the BKT “Learn” parameter 
values estimated from simulated and human data. 


harder parameters to set. The “Known” parameter (P(Lo)) 
was near 0 for all skills in the simulated data because all 
agents start off without any prior knowledge. In human 
data, this parameter will vary based on the learning context. 
The “Guess” and “Slip” parameters based on the simulated 
data were reasonable (both greater than 0), but exhibited 
no notable correlation with the human guess and slip values. 
Taken together, we argue that this approach is a promising 
way to identify initial parameter values for BKT, but more 
research is needed to explore and generalize this idea. 


4.3 Knowledge Tracing Sequence Analysis 
Next, we take a closer look at the estimates of several knowl- 
edge tracing approaches for different sequences. Figure 6 
shows the correctness sequence of a single student on “AD 
Answer Denominator” KC. This student was taken from the 
BKT_random condition. We added the mastery predictions 
generated by BKT, Streak and DKT given this sequence. 
For BKT and Streak, only the correctness on this skill was 
used. For DKT, the model was given the entire student se- 
quence for all KCs, but only the predictions for this KC are 
shown. Predictions were taken at where the student just 
finished the problem that contained the target KC. 


For BKT, the estimates trend towards increased mastery 
over the course of practice, but sometimes the probability 
decreases when it gets an item wrong. For Streak, each cor- 
rect response yields a 33% increase in the mastery predic- 
tion, accumulating to 100% by the third correct response in 
a row. The DKT models predictions tend to jump around, 
but generally do not seem to be increasing despite getting 
the problem correct multiple times in a row. For example, 
the model’s probability of correct jumps to 100% before re- 
turning to and staying close to 0%. 


comm] | KIKIKI VY Viv xl 


BKT Mastery | 0% 13% 16% 16% 72% 


Streak Mastery | 0% 0% 0% 0% 33% 67% 
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Figure 6: Different Model Predictions on “AD Answer De- 
nominator” given student correctness sequence. 


To figure out why the DKT model has such erratic behavior 
and why it gives so many AD problems, we fed a complete 
sequence from one student into the DKT model (student 
from BKT_random condition). Figure 7 shows the predic- 
tion of mastery for each KC after the student has completed 
each problem (problem type shown on the x-axis). It seems 
that the student never masters the “AD Answer Numerator” 
or “AD Answer Denominator”, which explains why the DKT 
tutor is giving almost all the AD problems to the students. 


Upon further investigation, we discovered that the DKT 
model has a fundamental issue that makes it difficult to use 
for mastery learning. The issue is caused by using the DKT 
predictions between problems when the mastery learning 
system is determining which KCs are mastered before pick- 
ing another problem. Unfortunately, for multi-step problems 
some KCs cannot be correctly applied on the first step. DKT 
correctly predicts these KCs will have near 0% correctness 
(any attempts will be incorrect). However, this has the side 
effect of confusing the mastery learning system into thinking 
that the KC is unmastered. When the DKT model actually 
reaches a step where the KC can be correctly applied, then 
its predicted probability jumps to a more realistic estimate 
of the mastery. This problem was not identified in previous 
work on mastery learning with DKT (e.g., [15]) because the 
prior work only looked at problems with a single step, so 
this issue never occurred. However, most problems within 
tutoring systems are multi-step. Future work should explore 
how to correct this issue within DKT so it can be used for 
mastery learning. 
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Figure 7: DKT predictions for each KC after each problem. 


5. SUMMARY OF KEY FINDINGS 


Our first key finding is that simulated students can success- 
fully evaluate online knowledge tracing models. Our simu- 
lations indicate that Streak works well for mastery learning 
and has reasonable efficiency; although the model gives a bit 
more practice than strictly necessary (e.g. the average num- 
ber of steps to master all AD KCs is around 17 in Streak and 
10 in BKT_random). Still, Streak is very simple to operate, 
implement, and modify and it behaves reasonably well. 


BKT also work well for mastery learning and generally seems 
to have the best efficiency of the approaches we compared 
(although DKT seems to be more efficient for AS and M 
problems). However, it seems to stop a little early in some 
cases, resulting in under practice. Figure 3 shows that the 
BKT_random model gets 86% correctness prediction for the 
KC “AD Answer Numerator”, 73% for “AD Answer Denom- 
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inator” and 83% for “AS Answer Denominator”. To some 
extent, it reveals that these KCs might be more difficult for 
both simulated students and human students to master, and 
more practice is required to obtain mastery. Our work sug- 
gests that it is important to look at multiple factors when 
evaluating knowledge tracing approaches in online settings. 
In particular, it is important to look at both student mas- 
tery and the amount of practice that is administered. To 
explore this trade off, we proposed the efficiency metric. We 
believe that this metric (or ones like it) might be useful for 
evaluating knowledge tracing approaches in online settings. 


Multiple studies [18, 17] suggest that the DKT model has 
good predictive performance in offline setting. However, 
we found that it does not seem to work properly for on- 
line knowledge tracing, particularly in cases of multi-step 
problems. Yeung & Yeung [19] identified that DKT’s pre- 
dictions for KCs are not consistent across time-steps, which 
we believe is related to the issue we encountered. Even when 
a student performs well on a KC, DKT’s predicted perfor- 
mance for that skill may drop. In general, DKT’s predictions 
fluctuate drastically over time, so when its predictions are 
sampled has a big impact on its accuracy. The core issue 
is that DKT generates predictions for every KC at every 
step, but its loss function only constrains predictions that 
are likely to occur next. Yeung & Yeung suggested some 
possible modifications to the DKT objective function to mit- 
igate this problem, but implementation and testing of these 
was beyond the scope of the current study. 


In preliminary analysis, BKT estimated students reached 
mastery even though their error rates were still high. Fur- 
ther inspection revealed an error in our BKT model that was 
causing it to incorrectly estimate student mastery. Multiple 
researchers across multiple labs have used this open-source 
implementation. Despite wide use, we uncovered issues that 
had not been previously uncovered. Although we corrected 
these issues for the current study, we argue that this is a 
positive outcome for our simulated student approach. This 
finding reinforces the idea that simulated students can be 
used to test and improve knowledge tracing approaches be- 
fore running more costly human studies. 


Our second key finding is that researchers might use data 
generated by simulated students to initialize knowledge trac- 
ing parameters when human data is not available. To evalu- 
ate the feasibility of this idea for BKT, we conducted a corre- 
lation analysis between the random-based BKT parameters 
and human-based parameters. We found a strong correla- 
tion between the learning rate parameters suggesting that 
initializing BKT parameters using simulated student data 
might be an informed, but cost-effective approach. 


6. RELATED WORKS 


The closest work to ours is the simulation studies conducted 
by Doroudi et al. [4, 3], which investigates different knowl- 
edge tracing approaches using simulated students. They ar- 
gue that it is important to evaluate knowledge tracing under 
various assumptions about how students learn. One of the 
major differences between their approach and ours is that 
they use statistical models that predict correctness to simu- 
late students rather than computational models of learning 
that actually learn and perform the task, as we do with Ap- 


prentice agents. Apprentice agents are more complex than 
the knowledge tracing approaches that are being used to 
evaluate them. It would be interesting to explore the use of 
Apprentice agents as another kind of student model for the 
knowledge tracing evaluations proposed by Doroudi et al. 


7. CONCLUSIONS AND FUTURE WORK 


We were able to successfully apply simulated students to 
test different knowledge tracing models. When we com- 
pared the three knowledge tracing models (BKT, Streak, 
and DKT) to a no-knowledge-tracing baseline (Random), 
we found that BKT gave the fewest problems, Streak gave 
the second fewest, Random gave the most and DKT gave al- 
most as many as Random in one problem type and the least 
in the other two. In general, we found that BKT seemed to 
be the most efficient approach, but streak gave reasonable 
results despite its simplicity. Through the use of simulated 
students, we also discovered a number of issues with our 
BKT implementation as well a fundamental issue with DKT. 
Despite widespread use of the BKT implementation and a 
lot of recent investigation into the DKT model, these issues 
had not been discovered in prior work. Together, these re- 
sults support our primarily claim that simulated students 
are an effective tool for investigating and evaluating online 
knowledge tracing approaches. 


Our analysis also found evidence to support the idea that 
simulated student data might be used to initialize BKT pa- 
rameters when no human-student data is available. In par- 
ticular, we found that BKT learning rates estimated from 
simulated data have a significant correlation to the learning 
rates estimated from human data. While these initial results 
are promising, more work is needed to further explore these 
ideas. In particular, we would like to try running human- 
subject experiments to compare BKT models initialized us- 
ing simulated student data to those with default parameters. 
One surprising finding is how well BKT_default performs; 
despite somewhat arbitrary parameters, it was more efficient 
than Streak. Future work should explore how to manually 
pick robust default values for BKT. 


We have a number of additional future directions we would 
like to explore. We intend to individualize the Apprentice 
models to make them better mimic the behaviors of dif- 
ferent kinds of learners (e.g., high vs. low performing stu- 
dents), students with different motivation in learning, and 
those who suffer from learning disabilities. We should also 
explore variations of DKT that address concerns we have 
identified and enable its use in online mastery learning. Fi- 
nally, we should move beyond simulation and explore how 
well our simulated students predict which knowledge tracing 
approaches will yield the best learning for human students. 
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ABSTRACT 


Knowledge tracing algorithms such as Bayesian Knowledge 
Tracing (BKT) can provide students and teachers with help- 
ful information about their progress towards learning ob- 
jectives. Despite the popularity of BKT in the research 
community, the algorithm is not widely adopted in educa- 
tional practice. This may be due to skepticism from users 
and uncertainty over how to explain BKT to them to foster 
trust. We conducted a pre-registered 2x2 survey experiment 
(n=170) to investigate attitudes towards BKT and how they 
are affected by verbal and visual explanations of the algo- 
rithm. We find that ostensible learners prefer BKT over a 
simpler algorithm, rating BKT as more trustworthy, accu- 
rate, and sophisticated. Providing verbal and visual expla- 
nations of BKT improved confidence in the learning appli- 
cation, trust in BKT and its perceived accuracy. Findings 
suggest that people’s acceptance of BKT may be higher than 
anticipated, especially when explanations are provided. 


Keywords 
Bayesian Knowledge Tracing, Data Visualization, Explain- 
able AI 


1. INTRODUCTION 


Knowledge tracing can offer students and teachers a real- 
time understanding of what students have already learned 
and what they are still struggling with [7]. It provides ac- 
tionable insights that can lead to better educational out- 
comes [16]. Among many types of knowledge tracing algo- 
rithms, Bayesian Knowledge Tracing (BKT) has been es- 
tablished and researched most extensively, as evidenced by 
the 114,000 Google Scholar results for "Bayesian Knowl- 
edge Tracing,” 17,500 of which published since 2020. BKT 
has been tested to help students self-monitor their learning 
progress [4, 23], to help teachers understand what students 
have not learned yet [22], and to enable adaptive learning 
technologies that let students skip over the content they have 
mastered [18]. In contrast to the abundance of research on 
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BKT, including hundreds of articles devoted to incremental 
enhancements of the original model [20], there are not many 
real-world applications that use BKT in practice. Some of 
the most widely used K-12 learning platforms like ASSIST- 
ments and Khan Academy decided against using BKT in 
favor of simpler models such as N-Consecutive Correct Re- 
sponses (N-CCR) [13]. This raises questions about barri- 
ers to adopting knowledge tracing algorithms in educational 
practice. In particular, how much is the relative complexity 
and opacity of BKT responsible for its slow adoption? Plat- 
form providers may be concerned that educators and learn- 
ers will not trust a model that cannot easily be explained to 
them [13, 12, 24, 25, 1]. 


The Technology Acceptance Model (TAM) posits that a 
user’s acceptance and adoption of new technology is based 
on its perceived usefulness (PU) and perceived ease of use 
(PEOU) [9]. PU and PEOU are beliefs that can be influ- 
enced by external factors, such as providing additional in- 
formation about a technology. According to TAM, learn- 
ers’ and educators’ PU and PEOU are essential factors in 
the adoption of BKT in practice. Improving their percep- 
tions could therefore increase the acceptance and adoption of 
BKT in real-world applications. Moreover, a better under- 
standing of the mechanisms behind the acceptance of BKT 
is expected to inform the presentation of other knowledge 
tracing algorithms as well. 


A large number of knowledge tracing algorithms have been 
developed over the years that could benefit from empiri- 
cal evidence on how to explain them to users. Recent ad- 
vances in artificial intelligence have inspired research into 
more complex algorithms such as deep knowledge tracing 
(DKT), which uses neural networks [17, 11]. With more 
complex algorithms that provide less insight into their inner 
workings, it becomes more important to understand how 
people’s trust in the algorithm and its perceived accuracy 
might influence perceptions of usefulness and usability of 
a learning application [1]. Besides BKT and DKT, which 
are suitable for modeling understanding and sense-making, 
there are also logistic learning models, such as Additive Fac- 
tor Models and Performance Factor Analysis [5, 19, 20], 
which model memory and fluency [20]. These two types 
of models can also be integrated into one [15]. While there 
are many types of models that can be examined, we choose 
BKT as an example knowledge tracing algorithms that is 
relatively simple and popular among researchers. 
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This research contributes causal evidence to address three 
important research questions. First, do people prefer to 
learn with BKT or N-CCR (N-Consecutive Correct Re- 
sponses) in an ostensible high-stakes test scenario? Second, 
how is their preference related to specific attitudes, includ- 
ing their confidence in the learning system to do well on a 
test, their trust in the algorithm, and the perceived accu- 
racy of the algorithm? And third, how do verbal and/or 
visual explanations affect people’s attitudes and preferences 
over knowledge tracing algorithms? We answer these re- 
search questions with data collected from a pre-registered 
2x2 factorial survey experiment. 


2. BACKGROUND 


One of the simplest knowledge tracing algorithms is N-CCR. 
It assesses student mastery by evaluating the number of con- 
secutive correct responses for a particular skill. For example, 
the model determines that a student has learned fractions 
after correctly answering three fraction questions in a row. 
Although N-CCR is easy to understand, its simplicity can 
sometimes make it less accurate than BKT. Still, N-CCR 
has been used in popular platforms, including ASSISTments 
and Khan Academy [13], and there is mixed evidence as 
to whether BKT outperforms N-CCR at modeling student 
learning [8, 10, 13, 21]. Nevertheless, the scientific com- 
munity shows a clear preference for BKT (and other more 
complex knowledge tracing algorithms) based on the alloca- 
tion of research attention. 


BKT is a two-state Hidden Markov Model where the unob- 
served hidden state being modeled is student learning, and 
for a given knowledge component, a student has a state of 
either learned or not learned [6, 17, 11]. Although BKT is 
already more sophisticated than N-CCR, critics have sug- 
gested that BKT is too simple of an algorithm for modeling 
human learning. They point to deep (neural network) learn- 
ing models to better represent all factors that go into student 
learning [17, 11]. Mao and colleagues [17] found that deep 
learning models outperformed BKT on some learning tasks. 
However, they also acknowledge that these gains in perfor- 
mance might not be worth the loss in model interpretability. 
While researchers tend to consider BKT as one of the sim- 
pler and more explainable algorithms for knowledge tracing, 
practitioners and learners who are the end-users may not 
share this view. 


The explainability of an algorithm, which is partly deter- 
mined by how transparent, understandable, interpretable it 
is, can play an essential role in its adoption into applica- 
tions. Barredo Arrieta and colleagues [1] identified these 
and other reasons for making algorithms more explainable: 
most relevant to the work on BKT are trustworthiness, con- 
fidence, causality, and accessibility. Prior research on algo- 
rithms in education has echoed this finding. Kizilcec [14] 
found that increasing transparency by providing users with 
additional information about an algorithm made users trust 
the algorithm more (though too much information can erode 
trust). Other studies have more specifically examined the in- 
terpretability of BKT in learning applications. Yeung [24] 
explored the use of Item Response Theory to make BKT 
and deep learning models more explainable, but they have 
not examined how users react to it. Zhou and colleagues [25] 
examined BKT explainability by creating visualization ”ex- 


plainables.” They then designed an experiment to determine 
the effectiveness between a static and interactive visualiza- 
tion and found that the static explainable led to a better 
understanding of the BKT algorithm. More generally, re- 
search on Open Learning Models (OLMs) has advanced an 
understanding of how to visualize and explain learning mod- 
els [3, 2]. OLMs provide users with interactive visualizations 
that grant them insights into learning algorithms, along with 
the ability to adjust the algorithm. This study will add to 
OLM research by expanding knowledge on how to explain 
and visualize information to foster positive attitudes. 


The current study provides a foundational understanding of 
how individuals perceive BKT compared to N-CCR along 
several attitudinal dimensions, and how much verbal and 
visual explanations of BKT can improve those perceptions. 
Our review of prior work informed the following two hy- 
potheses: 


H1. Verbal and visual explanations of BKT lead participants 
to prefer it over N-CCR. 


H2. Verbal and visual explanations of BKT will positively 
increase participants attitudes about the BKT algorithm. 


3. METHODS 


The study design, materials, measures and analysis ap- 
proach are pre-registered with the Open Science Foundation: 
https://osf.io/7c5zt/. To refine the study design, measures, 
and analysis plan, we ran a pilot study with 26 participants 
and used both descriptive and inferential statistical analyses 
to build our analysis plan. We first used descriptive analysis 
to estimate survey completion time, ensure we had enough 
variance in responses, and check that the information pro- 
vided to participants was enough information for them to 
evaluate the algorithms. We used respondents’ answers and 
an open-ended question at the end of the survey in which 
we asked participants for any feedback to improve the sur- 
vey. We took the results from this pilot study to alter the 
visualizations and information provided to participants and 
rephrase some questions to improve clarity. We removed 
the open-ended feedback question from the survey after the 
pilot. 


3.1 Participants 

Participants were recruited from Amazon Mechanical Turk 
and received $1.70 for completing a 10-minute survey. The 
study was advertised as seeking input on test preparation 
applications. To determine our target sample size of 170, 
we used G*Power to conduct a power analysis. Our analysis 
goals were to obtain 95% power to detect a medium effect 
size of 0.25 at the standard 0.05 alpha error rate with six 
repeated measures and four groups. While we had 170 par- 
ticipants who took the survey, 34 participants either failed 
to answer all of the comprehension questions correctly (29) 
or had prior experience with BKT (4) or both (1). Analyses 
were conducted on the remaining 136 respondents. Table 1 
describes the sample demographics for the sample. 


3.2 Procedure 

To contextualize the study, participants were provided the 
following narrative with pictures of two sample questions 
taken from the ASSISTments platform: 
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Figure 1: Two versions of a visualization of student performance on questions shown to participants depending on their 


condition assignment. 


Table 1: Sociodemographics of Participants included in the 
study. 


n % 

Gender Woman 79 58.1 
Man 54 39.7 
Transgender Man 1 ot 

Gender Variant /Non-Conforming 2 1.5 

Ethnicity Hispanic 10. 7.4 
Not Hispanic 126 92.6 

Race White 100 (73.5 
Black or African American 10 7.3 

American Indian or Alaska Native 2 1.5 

Asian 16 11.8 

Not Listed 5 3.7 
Multiracial 3 2:2 

Age 18-24 27 19.9 
25-34 52 38.2 

35-44 37 = 27.2 

45-54 8 5.9 

55-64 11 8.1 

Above 65 1 Af 


As an admissions requirement for a university program that 
you are applying for, you are preparing to take a general 
knowledge exam. The test is important to you and you need 
to do as well as possible to get accepted. 

You have decided to use a test preparation app to help you 
study for the test. 

A key feature of the test prep app is that it personalizes the 
learning experience to help you study efficiently. The app 
shows you only questions about topics that you have 
not already learned. 

The system keeps track of your answers to each question and 
automatically moves to the next topic once it deter- 
mines that you have learned the previous topic. To 
determine if you have learned a topic, the app uses an al- 
gorithm. Once the algorithm determines you learned a 
topic, it will stop giving you study questions about it. Thus, 
it also determines the speed at which you progress in your 
test prep. 

We would like to get your opinions about the two different 
algorithms to understand which one you find more accurate 
and trustworthy. 


On the next page, participants answered three multiple- 
choice comprehension questions: (1) How does the test prep 
app determine what questions to give you? (2) What de- 
termines how quickly you are going to be done with test 
prep? (3) What happens when the system determines that 
you have learned a topic? We pre-tested these questions 


to ensure that an attentive reader would have no problems 
answering them correctly. 


Next, participants saw a short description of the N-CCR 
algorithm, which we labeled as 3 Right in a Row (3RR): 
”A topic will be considered learned once a student correctly 
answers three questions in a row.” A simple table depicting 
a sample student’s progression for four topics (table rows) 
and questions for each topic (table columns) accompanied 
the description. The table looked like Figure la. Each 
cell contained an X or a V depending on if the student an- 
swered the question correctly. True to the 3RR algorithm, 
each topic was considered learned once three consecutive 
questions were answered correctly. At the bottom of the 
page, participants answered several questions about their 
attitudes towards the 3RR algorithm (see Measures). 


At this point, participants were randomly assigned to con- 
ditions based on a 2x2 factorial design. There were 33 par- 
ticipants in the No BKT Explanation/BKT Simple Visu- 
alization condition in the final sample, 34 in the No BKT 
Explanation/BKT Detailed Visualization condition, 38 in 
the BKT Explanation/BKT Simple Visualization condition, 
and 31 in the BKT Explanation/BKT Detailed Visualization 
condition. 


The following page mirrored the structure of the previous 
one but for BKT, providing a description and sample learn- 
ing progress visualization based on the experimental assign- 
ment, followed by the same set of attitudinal questions about 
the algorithm. Next, on the final page of the survey, partic- 
ipants were asked to compare the two algorithms. 


3.3. Experimental Manipulations 

In the no BKT explanation condition, participants received 
this one-sentence description of the BKT algorithm: ”A 
topic will be considered learned once the algorithm estimates 
with a high probability that a student has learned the topic 
based on their responses up to that point.” In the BKT ex- 
planation condition, participants additionally received the 
following information about the BKT algorithm: 


After every question you answer, the Bayesian Knowledge 
Tracing algorithm estimates the probability that you have 
now learned a topic using a probabilistic model that accounts 
for the following data: 

— an initial probability that you have learned the topic based 
on your first answer: it is higher if you answered correctly 
— a correct guess probability: e.g., 50% for a true/false 
question 

— a slip probability for answering incorrectly even though 
you already learned the topic 
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— the difficulty of questions you have answered based on how 
many people have answered them incorrectly 

— performance data such as the number of hints that you 
asked for and the time it took you to answer the question 


Using all of this information, the algorithm estimates 
the probability that you have learned a topic. If the proba- 
bility 1s above 95%, the algorithm moves you on to the neat 
topic. 


In the BKT simple visualization condition, participants re- 
ceived a simple table depicting a sample student’s learning 
progress mirroring the one shown in the 3RR algorithm (Fig- 
ure la). In the BKT detailed visualization condition, the 
same table was enhanced to show the estimated probability 
of having learned the topic using a color scale (Figure 1b). 
To make the visualizations realistic, we ran a BKT algo- 
rithm over a sample of ASSISTments data and used a 95% 
probability to determine mastery. 


3.4 Measures 

We measured participants’ attitudes towards each algorithm 
using six items rated on 5-point unipolar response scales 
(Not at all’, Somewhat’, Moderately’, ’Very’, Extremely’): 


Confidence: ”How confident are you that the test prep app 
with this algorithm will prepare you to do very well on the 
test?” 

Understanding: “How well do you understand how this al- 
gorithm determines if you have learned a topic?” 
Sophistication: ”How complex is this algorithm for determin- 
ing if you have learned a topic?” 

Accuracy: ”How accurate is this algorithm at determining if 
you have learned a topic?” 

Trust: "How much do you trust this algorithm to determine 
what you have learned?” 

Speed: ”How quickly do you learn the materials for the test 
using this algorithm?” 


At the end of the survey, participants rated their general 
preference over the two algorithms in response to the follow- 
ing question: ”Now that you have learned about the 3 Right 
in a Row (3RR) and Bayesian Knowledge Tracing (BKT) 
algorithms, which one would you prefer to use for your test 
prep?” Response options were on a 7-point bipolar scale: 
*Strongly prefer 3RR’, Moderately prefer 3RR’, ‘Slightly 
prefer 3RR’, ’Neither prefer 3RR nor BKT’, ‘Slightly pre- 
fer BKT’, "Moderately prefer BKT’, Strongly prefer BKT’. 
Participants were invited to provide a rationale for their 
preference using an open-ended question: ”Please tell us why 
you prefer the algorithm that you choose above.” 


3.5 Analytical Approach 

We used the pilot study data to finalize our analysis plan 
by developing our inferential analysis. For H1, we decided 
to use linear regression to understand if the conditions had 
an association effect on the participants’ overall preference. 
We used the conditions as the predictor variables and the 
preference as the outcome variable. We next decided to use 
multiple linear regression to understand the association be- 
tween the attitudinal constructs and algorithm preferences. 
This analysis used the attitudinal constructs as the predic- 
tor variables with preference as the outcome variable. The 
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BKT Detailed Visualization 
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BKT Simple Visualization 
No BKT Explanation/ 

BKT Detailed Visualization 
No BKT Explanation/ 

BKT Simple Visualization 
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Avg. Algorithm Preference 
(1=Strongly prefer 3RR, 7=Strongly prefer BKT) 
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Figure 2: Average algorithm preference by condition. 


last planned analysis evaluated H2 by running a linear re- 
gression on each attitudinal construct with the conditions 
as the predictor variables and the attitudinal construct as 
the dependent variable. While the interpretation of linear 
regression output is clear and familiar, we acknowledge that 
our measures are ordinal and not strictly continuous. We 
confirmed that analysis by ordinal logistic regression yields 
equivalent results. 


For the open-ended question asking participants why they 
choose their preferred algorithm, we planned to use simple 
thematic coding. While we used the pilot data to create our 
analysis plan, we did remove all pilot data from the final 
dataset. 


4. FINDINGS 


First, we examine which algorithm participants preferred 
overall. Figure 2 shows their average preference in each 
condition, which varied between 5 (i.e. Slightly prefer 
BKT) and 6 (i.e. Moderately prefer BKT). While there 
is a suggestive pattern that providing more explanation 
for BKT strengthens the preference for BKT, this pattern 
was not statistically significant (linear regression: F3,132 = 
0.7455, p = 0.5268). This means the data do not support 
Hi. 


— confidence —— soph — trust 


— understand —- acc -—— fast 


Avg. Response (3RR(-) to BKT(+)) 


T T T 

Prefer 3RR/ Slightly/ Strongly Prefer BKT 
Neither Moderately Prefer BKT [7] (n=54) 

[1,4] (n=31) [5,6] (n=51) 


Algorithm Preference Bin 
(1=Strongly prefer 3RR, 7=Strongly prefer BKT) 


Figure 3: Average response on each measure at three levels 
of preference: Prefer 3RR to Neither, Slightly and Mod- 
erately Prefer BKT, and Strongly Prefer BKT. We choose 
these groupings because each group represents approxi- 
mately 1/3 of the sample. 
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Figure 4: Average differences (BKT score - 3RR score) in significant attitudinal constructs as a function of the randomly 
assigned conditions. Positive scores indicate a higher score for BKT. 


Next, we examine how algorithm preference is related to 
the six attitudinal measures: confidence in the learning 
system, the sophistication of the algorithm, trust, under- 
standing, accuracy, and speed. We use the repeated mea- 
sures design of our study by computing the difference score 
for each question: subtracting the participant’s 3RR re- 
sponse from the BKT response. Figure 3 shows the av- 
erage response on each measure at three levels of prefer- 
ence: Prefer 3RR to Neither, Slightly and Moderately Pre- 
fer BKT, and Strongly Prefer BKT. We choose these group- 
ing because each group represents approximately 1/3 of the 
sample. All measures are positively correlated with pref- 
erence as evidenced by their positive slopes (all Pearson’s 
r > 0.325, p < 0.0001), but accuracy, confidence, and trust 
are correlated more strongly (r > 0.739). This highlights the 
importance of these three constructs in determining people’s 
preference over the algorithms. In fact, the six measures ex- 
plain 66.7% of the variance in preferences (multiple linear 
regression: F§,129 = 43.07, p < 0.0001). 


Lastly, we examine how the provision of verbal and/or visual 
explanations influenced participant attitudes about BKT. 
Figure 4 shows the average response in each condition for 
the four measures that were significantly affected by the in- 
tervention (i.e., relative understanding and speed did not 
change significantly at p < 0.1). We find that confidence 
in the learning application with BKT (relative to 3RR) im- 
proved when both a detailed explanation and visualization 
were provided (F3,132 = 2.88,p = 0.03844). Likewise, the 
perceived accuracy of BKT improved with both types of ex- 
planation provided (F3,132 = 3.28,p = 0.02305). Trust in 
BKT improved by providing a detailed explanation, espe- 
cially when complemented with the detailed visualization 
(F3,132 = 2.346, p = 0.07575). Finally, and not surprisingly, 
the more detail was provided, the more sophisticated BKT 
was perceived to be (F3,132 = 11.17,p < 0.001). This pro- 
vides evidence in support of H2. 


In all of our analyses, we tested for the presence of demo- 
graphic heterogeneity in results. However, no significant de- 
mographic sources of variation were found in our sample. 


5. DISCUSSIONS 


This study investigated people’s attitudes towards BKT rel- 
ative to a more straightforward knowledge tracing algorithm 
and tested the effect of additional information via explana- 
tions and visualizations on their attitudes. Understanding 
how students might perceive the algorithms used in their 
learning applications is a crucial issue for the adoption and 
usability of these tools [9]. The results provide evidence sup- 
porting our second hypothesis that additional explanations 
improve key attitudinal measures of confidence, perceived 
accuracy, trust, and sophistication. Qualitative data from 
participants echo this result: 


For something high stake, I’d only trust the 
methods that employs a variety of learning 
modalities. The analytics for such should match 
the complexities of my learning process as well 
as the nature of the material I’m learning. The 
BKT would put me more at ease than the quick 
route of the 3RR approach. (Participant as- 
signed to BKT Explanation and BKT Detailed 
Visualization who had high confidence, sophisti- 
cation, accuracy, and trust in BKT relative to 
3RR) 


Surprisingly, we did not find a significant increase in people’s 
preference for BKT (H1), even though we found that algo- 
rithm preference is explained largely by people’s perceptions 
of accuracy, trust, and confidence. This preference for BKT 
regardless of experimental condition is furthered explained 
by the qualitative responses from participants: 


I don’t believe the 3RR algorithm is at all benefi- 
cial to the student attempting to learn the topic. 
If the student just happens to get 3 exception- 
ally easy questions in a row, the algorithm will 
assume that the student has learned the topic 
which is likely not entirely true. (Participant as- 
signed to No BKT Explanation and BKT Simple 
Visualization who Strongly Preferred BKT) 
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Nevertheless, participants generally preferred to use the 
BKT algorithm regardless of the experimental treatment. 


Since confidence, trust, and accuracy are important to a 
user’s preference for BKT, it was notable that those three 
measures were affected by the experimental manipulations. 
Consistent with prior studies of explainability and trans- 
parency in algorithmic systems, we also found that when 
more information about an algorithm is presented to peo- 
ple, they believe the algorithms to be more trustworthy and 
accurate, leading the user to have more confidence in the 
algorithm [1, 14]. This confidence can, in turn, increase the 
use of applications shown to improve educational outcomes. 
Our results further highlight the importance of OLM to pro- 
vide more transparency to gain user trust and confidence. 
In addition, future educational applications using complex 
knowledge tracing algorithms should include detailed verbal 
and visualization explanations of the algorithm to improve 
confidence and trust in the application. 


Frankly, we did not expect to find such strong support for 
BKT going into this study. Many learning platforms that 
have the capacity to implement BKT or even more com- 
plex algorithms have opted not to do so. Our informal un- 
derstanding was that this was largely due to concerns that 
users, such as math teachers who use ASSISTments to as- 
sign homework, would not understand BKT and not trust 
it and lose confidence in the platform. This understanding 
led us to expect that participants would report a preference 
for the simpler algorithm and report that they find it more 
trustworthy and have more confidence in a system that uses 
it. Therefore, we are surprised by our positive findings for 
BKT and propose future research directions to follow up on 
these results. 


Future work in this area should explore different partici- 
pant populations and scenarios. We are planning to run this 
study with student samples to see if we can replicate the re- 
sults. While we asked our participants to put themselves in 
the shoes of a student needing to prepare for a high-stakes 
test, running this experiment on a student population might 
make the scenario more realistic to the participants. Addi- 
tionally, given that many tasks on Mechanical Turk involve 
subjects annotating machine learning datasets, this group 
of participants may have more favorable attitudes to algo- 
rithms. Another area for future work includes changing the 
scenario. We ran the study from the viewpoint of a student. 
However, we acknowledge that intelligent tutoring systems 
deployed in classrooms also need the trust and confidence 
of the teachers administering the applications. We plan to 
rewrite the scenario to allow participants to take on the role 
of a teacher. 
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ABSTRACT 

We explore how different components of an Automatic Short 
Answer Grading (ASAG) model affect the model’s ability 
to generalize to questions outside of those used for train- 
ing. For supervised automatic grading models, human rat- 
ings are primarily used as ground truth labels. Producing 
such ratings can be resource heavy, as subject matter ex- 
perts spend vast amounts of time carefully rating a sam- 
ple of responses. Further, it is often the case that mul- 
tiple raters must come to a census before a final ground- 
truth rating is established. If ASAG models were devel- 
oped that could generalize to out-of-sample questions, ed- 
ucators may be able to quickly add new questions to an 
auto-graded assessment without a continued manual rat- 
ing process. For this project we explore various methods 
for producing vector representations of student responses 
including state-of-the-art representation methods such as 
Sentence-BERT as well as more traditional approaches in- 
cluding Word2Vec and Bag-of-words. We experiment with 
including previously untapped question-related information 
within the model input, such as the question text, ques- 
tion context text, scoring rubric information and a question- 
bundle identifier. The out-of-sample generalizability of the 
model is examined with both a leave-one-question-out and 


leave-one-bundle-out evaluation method and compared against 


a typical student-level cross validation. 


Keywords 
ASAG, Assessment, SBERT, Generalizability 


1. INTRODUCTION 


Automatic Short Answer Grading (ASAG) is an emerging 
field of research, as the education community has started 
to embrace the use of technology to assist students and ed- 
ucation professionals. It has been shown that the use of 
open-ended (OE) questions helps facilitate learning [7], but 
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educators are often deterred from their use because grading 
requires much more time than that for multiple choice [12]. 
In addition, human ratings may contain bias and vary in con- 
sistency, as rating choices are often subjective [28]. ASAG 
systems may be an important tool for educators, allowing 
more frequent use of OE questions, and more objective rat- 
ings for both formative and summative assessments. 


A key challenge with supervised automatic grading models 
is gathering a large enough sample of labelled data for train- 
ing. While some labeling tasks for supervised learning may 
be straightforward such as identifying an image as either a 
dog or a cat, others such as rating student responses require 
careful consideration. In high-stakes assessment scenarios, 
two or more ratings by different experts are often necessary 
to form a reliable consensus rating. Thus, obtaining labelled 
data to train an ASAG system can be arduous. It follows 
that quickly introducing new questions to an existing system 
may not be feasible if a data collection as well as the metic- 
ulous rating of new responses is necessary. Further, a new 
model would have to be trained and tuned with the newly 
collected responses. If we create more generalizable ASAG 
models, educators may have the flexibility to add new ques- 
tions to an existing assessment with very little effort, thus 
increasing the practical use of the ASAG system. 


We hypothesize that the inclusion of extra question related 
information within the model input may improve both the 
classification performance, and the generalizability of the 
model. For the purposes of this project, we formally define 
the generalizability of an ASAG model as the capacity to 
classify responses from out-of-training-sample questions. 


This research contributes to the field of automatic grading in 
three related ways. We focus on classification performance 
and generalizability of the supervised grading model in terms 
of 1) the textual representation type, 2) the content of the 
input and 3) the classification model. We compare three dif- 
ferent representation types, including those of state-of-the- 
art models: Sentence-BERT, Word2Vec, and Bag of Words. 
In terms of input content, we experiment with including 
previously untapped resources relating to the questions in 
the model input. Such resources include a question-bundle 
identifier, the question stem text, question context text, and 
rubric information. Extra input content is vectorized (if the 
source is textual) and concatenated to the response vectors 
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to be used as input to the classification model. Finally, 
we compare a non-neural model (a multinomial logistic re- 
gression) and a simple neural (a three layer feed forward 
network) model. 


In order to examine the generalizability of the model for each 
experiment, we use a leave-one-question-out evaluation pro- 
cedure where we train the model on N-1 questions, and use 
the one left-out question data as our test set. Thus, during 
training, the model has not yet seen responses from the ques- 
tion for which we use solely to evaluate the model. We go 
one step further in testing the generalizability of the model 
with a leave-one-bundle-out evaluation procedure where we 
train the model on M-1 question bundles (groupings of ques- 
tions that are related in context), and use the questions for 
the left-out bundle as the test set. We conceptualize the 
leave-one-bundle-out method as a more extreme test of the 
model’s ability to classify out of sample questions because 
even questions that are related in context have not been seen 
by the model during training. Additionally, we compare re- 
sults of our experiments against and a typical student level 
cross validation. Results from a majority class classifier are 
included as well for a baseline comparison. 


2. RELATED WORK 


This section outlines notable work relating to ASAG and the 
more general use of NLP for education. For this project, 
we build on the previous literature by considering lessons 


learned in preceding research, and employing novel approaches 


that, to our knowledge, have not yet been explored. 


2.1 ASAG work 


A systematic review of trends in ASAG [3] illustrates an 
increasing interest in the field of automatic grading for ed- 
ucation. Unsupervised methods have been explored such 
as concept mapping, semantic similarity, and clustering to 
assign ratings. For example, Mohler and Mihalcea [19] com- 
pared knowledge based and corpus based semantic similarity 
measures for automatic grading, Klein et al. [14] imple- 
mented a latent semantic analysis approach, and Basu et 
al. [2] used clustering to provide rich feedback to groups 
of similar responses. In addition, many types of supervised 
classification methods have been utilized for ASAG. Note- 
able examples include Hou and Tsao [13] who incorporated 
POS tags and term frequency with a Support Vector Ma- 
chine classifier, and Madnani et al. [16] who made use of 
simple features such as a count of commonly used words 
and length of response with a logistic regression classifier. 


More recent ASAG research exploits deep learning methods. 
Noteable work includes Zhang et al. [31] who used a combi- 
nation of feature engineering and deep belief networks, Liu 
et al. [15] who employed multi-way attention networks, and 
Yang et al. [30] who considered a deep autoencoder model 
specific to Chinese responses. Additionally, Qi et al. [21] 
created a hierarchical word-sentence model with a CNN and 
Bi-LSTM model and Tan et al. [25] explored the use of a 
graph convolutional network (GCN) to encode a graph of all 
student responses. 


Further, much of the newest ASAG work makes use of state- 
of-the-art transformer based models, including Gaddipati et 
al. [11] who evaluated four different types of response em- 


beddings, ELMo, GPT, BERT, and GPT-2 for their per- 
formance on an ASAG task, Camus and Filighera [4] who 
compared the performance of transformer models for ASAG 
in terms of the size of the transformer and the ability to 
generalize to other languages, and Sung et al. [24][23] who 
examined the effectiveness of pre-training BERT, including 
further pre-training the model on relevant domain texts. 


2.2 NLP for Education 


Literature addressing the general application of natural lan- 
guage processing (NLP) for various uses in field of education 
has grown quickly in recent years as well. For example, Fon- 
seca et al. [10] used NLP to automatically classify the pro- 
gramming assignments for students within given academic 
context, Thaker et al. [26] incorporated textual similarity 
techniques to recommend remedial readings to students, and 
Arthurs and Alvero [1] examined bias in word representa- 
tions for college admissions essays. Additionally, Xiao et al. 
[29] employed NLP and transfer learning methods for prob- 
lem detection in peer assessments, Venant and d’Aquin [27] 
utilized a concept graph to predict semantic complexity of 
short essays by written by English language learners, and 
Chen et al. [6] leveraged a variety of textual analysis meth- 
ods to predict student satisfaction in the context of online 
tutorial dialogues. 


We build on the previous literature by incorporating state- 
of-the art representation methods such as Sentence-BERT 
and a neural classification model. The novel contribution 
of this project includes both our focus on the generalizabil- 
ity of the model to out-of-training-sample questions, as well 
as the leveraging of previously untapped, question related 
information as input to the model. 


3. DATA SET 

The data we will use for this project was sourced from a 
2019 field test of a Critical Reasoning for College Readiness 
(CR4CR) assessment [17] created at the Berkeley Evaluation 
and Assessment Research (BEAR) center. The data consists 
of 5,550 student responses from 558 distinct students to 33 
different items. The field test included other items that were 
multiple choice, but these questions were filtered out of the 
data for our use. The mean number of responses per ques- 
tion is 179 with the minimum being 128 and the maximum 
being 313. Most of the items belong to an item bundle - a 
grouping of items that are related in context and/or share 
a common question context. Additionally, the items were 
administered in four different test forms, where some items 
were included in multiple forms. The items all relate to four 
constructs about student understanding of algebra. 


An example of one of the items, labelled ‘Crude Oil 4ab’ is 
included in Figure 1. For this item, students are presented 
with two images relating to oil production - one being a line 
graph and the other, a table. With the given context, stu- 
dents are presented with a choice between the graph or the 
table for which would be better to represent the historical 
patterns, or change over time of the oil production. The 
correct answer for this question is the graph, and students 
are expected to provide reasons why this is the right choice. 


An example of a student response to the Crude Oil 4ab ques- 
tion (shown in Figure 1) rated at the highest (most correct) 
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The following graph and table show annual crude oil production in million tonnes of oil equivalent 
(mtoe) from 1960 to 2014. 


Crude Oil Production: 1960-2014 
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Year 
[a] For a group project, you and your classmates have to present the overall historical patterns of 
annual oil production as a poster. Due to limited space, only one of the following representations 
can be included in the poster: a table, a graph, or a set of equations. Which representation should 
you use? 


[b] Briefly explain your answer choice in [a]. 


Figure 1: An example of an item from the data set. 


score category is shown: “the graph easily displays patterns 
over time whereas the table and equations require more ana- 
lyzing.” In contrast, a student response to the same question 
rated at the lowest (most incorrect) category is shown: “the 
table is more clear, the information is seen in the table.” 


Responses were rated from 0 (fully incorrect) to 4 (fully 
correct) by multiple researchers and subject-matter experts 
at the BEAR center. The quality and consistency of ratings 
were evaluated by an inter-rater reliability score, and when a 
high percentage of rating mismatches between raters existed, 
incongruous ratings were discussed until a consensus was 
reached by the raters. 


4. METHODS 


In this section, We briefly introduce the representation meth- 
ods and model classes that we include in our experiments. 
Additionally, we describe the question related information, 
beyond that of the responses, that are used as inputs to 
the model. Further, we outline our experiments in detail 
including methods for evaluation and comparison. 


4.1 Input Representations 

We chose to include three distinct, yet commonly used, rep- 
resentation types in our experiments: A count-based method 
that elicits the distinct vocabulary of our data (Bag of Words), 
a simple neural method that utilizes pre-trained word vec- 
tors (Word2Vec), and a state-of-the-art, contextual neural 
method (Sentence-BERT). A short description of each rep- 
resentation type is included. 


4.1.1 Sentence-BERT 

Sentence-BERT (SBERT) is a modification of the BERT 
network that utilizes siamese and triplet network structures 
to create semantically meaningful sentence embeddings [22]. 
SBERT fine-tunes the BERT network on a combination of 
the SNLI dataset and the Multi-Genre NLI datasets, to- 
taling about 1 million sentence pairs. Although sentence 
embeddings can be derived from the original BERT model 
using methods such as averaging the BERT output layers or 
using the [CLS] token embedding, it has been shown that 


such methods yield poor sentence embeddings [20]. In com- 
parison, SBERT sentence embeddings outperformed other 
state-of-the-art methods such as InferSent [8] and Universal 
Sentence Encoder [5] on the SentEval [8] benchmark, which 
gives an idea of the quality of sentence embeddings for var- 
ious tasks such. 


4.1.2 Word2Vec 

Word2Vec (W2V) ma13, is a neural model that creates vec- 
tor representations of words that have been shown to be 
semantically meaningful and useful in different NLP tasks 
[22]. We use an extension of the previously introduced Skip- 
gram model [18] that incorporates sub-sampling of frequent 
words during training in order to speed up training, and 
improves accuracy of representations of less frequent words. 
For this project, we use the Google News corpus of pre- 
trained word embeddings. Vectors of size 300 are created 
for each word, and in order to construct response embed- 
dings from the individual word vectors, we employ a simple 
but popular method: averaging the vectors of all words in 
the response. 


4.1.3 Bag-of-words 

The bag-of-words (BOW) model represents a document as 
a vector, or “bag,” of length equal to the number of unique 
words in the entire corpus and values of the vector equal to 
the frequency with witch its corresponding word occurred 
in the document. In our application, the bags are student 
short answers and additional question information. 


4.2 Input Content 

In an item design context, there are various untapped sources 
of information relating to a particular question that may be 
useful to include as input to a classification model. We ex- 
plore the use of four different sources of information, outside 
that of the response itself. A brief description of the such 
sources are included below. 


4.2.1 Question Text 

The question text consists of the direct question stem. As 
in the example item provided in Figure X, the question text 
would be: “Briefly explain your answer choice in [a].” 


4.2.2. Question context 

The question context includes any textual information be- 
yond that of the question stem that is related to the ques- 
tion, and might be useful for the respondent to produce a 
response. In the question example in Figure 2, the question 
context would include: “The following graph and table show 
annual crude oil production in million tonnes of oil equiv- 
alent (mtoe) from 1960 to 2014. For a group project, you 
and your classmates have to present the overall historical 
patterns of annual oil production as a poster. Due to lim- 
ited space, only one of the following representations can be 
included in the poster: a table, a graph, or a set of equa- 
tions. Which representation should you use?” We note that 
not all of the included items have question context text be- 
yond that of the question text itself. For such items, it was 
not possible to include question context as part of the input. 


4.2.3 Rubric Text 
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Level Description Example Response 
‘Student provides a fully "The graph best illustrates the visual 

4 correct positive and negative {trend of oil production. The table or the 
justification for ... equation won't provide an overall ..." 


Student ids full i i 
DSERE roy es Hany: "The table cannot show a trend in the oil 


3 correct positive justification 
‘ roduction effectively. " 
for selected representation ... pI y 
Student provides a 
i p ‘ "The graph helped me with more 
2 partial/general positive : a 
i questions. 
justification for selected ... 
Student provides incorrect 
ae "Because the graph gives us better 
1 justification for selected es . ape 
. projections.’ 
representation. 
Student makes no attempt to 
re) provide a justification for "lam not sure why." 


selected representation. 


Figure 2: An example of a scoring rubric corresponding to 
the item in Figure 1. 


As part of the assessment cycle as previously mentioned, 
an important step in a measurement process is defining the 
outcome space for each item through a scoring guide, which 
is essentially a rubric. The scoring guide includes a detailed 
description of the reasons for which a response would be 
rated at a certain score, and is used as a guide for human 
raters. In addition, the scoring guide often includes example 
responses for each rating level. An example of a scoring 
guide for the ’Crude Oil 4ab’ item in Figure 1 is included 
shown in Figure 2. When the rubric text is included in the 
input text, we include both the level descriptions as well as 
the example response(s). 


4.2.4 Bundle Identifier 


As described above in the data section, most of the ques- 
tions belong to a bundle of questions - those that are linked 
based on a similar image, or context text. As a question 
bundle identifier, we concatenate a one-hot vector to the in- 
put vectors. Although items within the same bundle will 
often share the same context text, we include the one-hot 
bundle identifier within our extra input text experiments so 
that we can infer whether the model makes use of semantics 
within the context text, or rather just a general indication 
of similar questions. 


4.3 Classification Models 


We compare a multinomial logistic regression model with a 
simple neural network classification model. We chose these 
classification methods representing a linear transformation 
of the feature space to a label (regression) and a non-linear 
transformation (neural network). A brief overview of each 
model is included. 


Multinomial logistic regression (MLR) is a classification model 
that predicts probabilities of different outcomes for a cate- 
gorical dependent variable. In order to generalize to a K- 
class setting, the model runs K-1 independent binary lo- 
gistic regression models where one outcome is chosen as 
a “pivot” and other K-1 outcomes are separately regressed 
against the pivot outcome. We use the Limited-memory 
Broyden-Fletcher-Goldfarb-Shannon (LBfGS) algorithm for 
optimization [9], and incorporate L2 regularization. 


Additionally, we use a simple feed forward neural network 
on a categorical cross entropy loss function with 2 hidden 
layers of size 100, using rectified linear unit (ReLU) activa- 
tion functions for both hidden layers. We include dropout 


of 0.4 and utilize Adam optimization. We train the model 
for 16 epochs and use a batch size of 36. 


4.4 Evaluation and Model Comparison 
In this section, we enumerate our experiments and the eval- 
uation methods chosen for comparison. 


4.4.1 Experiments 

As input to our model, we experiment with 8 different com- 
binations of content to vectorize and concatenate to the re- 
sponse vectors before training our classification models: 


bundle one-hot + response 

question + scoring rubric + response 

bundle one-hot + question + scoring rubric + response 
8) bundle one-hot + question + question context + scoring 
rubric + response 


) 
) 
) 
4) scoring rubric + response 
) 
) 
) 


For each of the 8 combinations listed above, we create three 
different vector representations with the aforementioned meth- 
ods: 1) Sentence-BERT, 2) Word2Vec, and 3) Bag-of-words, 
resulting in 24 distinct input types. We fit a classification 
model for each of the input types, for both of our classifica- 
tion models. Thus, we compare 48 separate versions of an 
ASAG model with three types of evaluation. 


4.4.2 Leave-one-question/bundle-out Evaluation 

In order to assess the generalizability of the ASAG model 
to out-of-training-sample questions for each of the 48 exper- 
iments, we average the results of N (where N is the number 
of questions) independent models. For each of the N models, 
we train the classifier on data from N-1 questions, and test 
on data exclusive to the left-out-of-training question. In the 
case of the leave-one-question-out results, it is important to 
note that although the model has not seen data specific to 
the left-out question, it has seen questions that are part of 
the same question bundle and are therefore related. 


To expand our evaluation of generalizability further, we in- 
clude a leave-one-bundle-out metric for each experiment. 
For such, we average the results from M (where M is the 
number of bundles) independent models where we train the 
classifier on data from M-1 bundles, and test on data exclu- 
sive to the left-out-of-training questions which belong to a 
single bundle. So, these results give us an idea of whether the 
model can successfully rate responses from questions that 
have not been used for training, and when the model has 
not seen questions related by context during training. 


4.4.3 Evaluation Metrics 

We report our results in both multilabel accuracy, and weighted 
F 1 score because multilabel accuracy is both widely used and 
easy to interpret, and the weighted F1 score captures both 
the precision and recall and accounts for class imbalance. 


Multilabel accuracy represents the degree to which our model 
classifications agree with the ground truth labels (for this 
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Table 1: Experiment Results: Multilabel Accuracy 


Response | Bundle | Question | Context | Rubric Random Holdout Question Holdout Bundle Holdout 
Text ID Text Text Text SBERT W2V BOW SBERT W2V BOW SBERT W2V BOW Average (ACC) | Average (Wf) 

Majority Class 0.34715 0.37044 0.30521 0.34093 0.25429 
LogReg x 0.58034 0.49765 0.54665 | 0.35870 0.35338 0.36517 | 0.31806 0.25316 0.26995 | 0.39367 0.38323 
LogReg x x 0.59603 0.53154 0.55602 | 0.37225 0.37595 0.36097 | 0.32855 0.25784 0.28364 | 0.40698 0.39513 
LogReg x x 0.63062 0.55748 0.58702 | 0.40378 0.41290 0.32506 | 0.36276 0.30162 0.25982 | 0.42678 0.40317 
LogReg x x 0.62972 0.56036 0.58486 | 0.40352 0.41815 0.33053 | 0.34539 0.31713 0.27005 | 0.42886 0.40411 
LogReg x x 0.61657 0.55982 0.58846 | 0.38488 0.38379 0.42198 | 0.28175 0.23273 0.31443 | 0.42049 0.39944 
LogReg x x x 0.61818 0.56432 0.59152 | 0.40967 0.38968 0.37571 | 0.34737 0.26432 0.30380 | 0.42940 0.40213 
LogReg x x x x 0.62484 0.56144 0.59261 | 0.41534 0.40174 0.36048 | 0.32196 0.25499 0.28627 | 0.42441 0.40195 
LogReg x x 3% x x 0.61406 0.55963 0.59009 | 0.41494 0.39999 0.36104 | 0.31797 0.26117 0.29534 | 0.42380 0.39920 
NN x 0.60249 0.56450 0.60973 29 0.36930 0.25760 0.26633 | 0.40922 0.40709 
NN x x 0.60540 0.59802 0.61963 0.35526 0.23819 0.28716 | 0.42006 0.41686 
NN x x 0.65800 0.61009 0.34075 328 0.27179 | 0.44411 0.42600 
NN x x 0.66070 0.61388 0.39309 0.33176 0.28206 | 0.44500 0.42470 
NN x x 0.64017 0.61226 0.36721 0.35595 0.40597 .2338 0.33769 | 0.43395 0.41403 
NN x x x 0.62503 0.61009 0.62198 | 0.411387 0.39257 0.33095 0.25618 0.31157 | 0.43394 0.41456 
NN x x x x 0.63206 0.59981 0.62938 | 0.40885 0.36204 0.39455 0.26301 0.32980 | 0.44276 0.42075 
NN x x x x x 0.60557 0.59405 0.61478 | 0.42270 0.39130 0.35288 0.27360 0.31397 | 0.43002 0.40366 
Average (ACC) 0.62124 0.57468 0.60452 | 0.39500 0.38157 0.36140 | 0.33223 0.26919 0.29273 

Average (WF1) 0.60830 0.55063 0.59135 | 0.37680 0.35320 0.33350 | 0.31472 0.24976 0.28697 


project, human ratings). It is calculated simply as the num- 
ber of correct predictions divided by the number of total 
number of examples. The F1 score for a certain class is 
the harmonic mean of its precision and recall, where preci- 
sion is calculated as true positives divided by false positives 
and true positives, and recall is calculated as true positives 
divided by false negatives and true positives. In order to 
account for class imbalance, we specifically use the weighted 
F1 score. This metric calculates the F1 score for each class 
independently, and the overall score for all the classes is the 
average weighted by class size. 


5. RESULTS 


Results of our experiments are detailed in Table 1, reported 
in multilabel accuracy. For column and row averages, the 
weighted F1 score is presented as well. In the left-most half 
of the table, an x is present for a given row if the informa- 
tion type, indicated by the column header, is included in the 
model input. For example, results in the first row represent 
an input of only the response text and results in the sec- 
ond row represent an input of both the Bundle ID and the 
response. Additionally, the top half of the table results are 
those from the multinomial logisitic regression classifier, and 
the bottom half of the table results are those for the neural 
network classifier (as indicated by the leftmost column). For 
each of our evaluation methods, random holdout, question 
holdout, and bundle holdout, we present results for the three 
textual representation methods: SBERT, W2V, and BOW. 


In terms of the general performance of our classification 
models, we consider the random holdout evaluation method. 
Overall, SBERT representations performed best when aver- 
aging across the classification methods and input combina- 
tions, followed by the BOW representations (accuracy of 
0.621 for SBERT compared to 0.575 and 0.605 for W2V 
and BOW, respectively). Additionally, the neural network 
achieves higher accuracy than the logistic regression in gen- 
eral. Both SBERT and BOW perform notably well when the 
input includes the question text, or the question content. 


To assess the generalizability to grading answers to ques- 
tions unseen in the training set, we focus on the question 
holdout and bundle holdout results. Across the board, we 
see much lower accuracy for the question and bundle holdout 
experiments than that of the random holdout, with the bun- 
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dle holdout being the lowest. This is in line with what one 
might expect because for the question holdout, the model 
has not yet seen responses for the particular question in the 
test set and for the bundle holdout, the model has not seen 
questions even related to the test set question. 


Similar to the random holdout experiments, we see the same 
overall pattern for the question and bundle holdout experi- 
ments: SBERT is generally superior, followed by BOW and 
W2V, respectively. One notable difference for the question 
holdout experiments compared to those of the random hold- 
out is that we see increased performance when we include 
multiple extra sources of information. For example, with 
SBERT and bundle holdout, we achieve 0.365 accuracy with 
the neural network classifier when we include the rubric text, 
question text, and bundle ID. We might explain this result 
as, when the model is lacking previous information about the 
test question from training, extra input information might 
provide guidance for the model. 


For the question and bundle hold out experiments, the ad- 
dition of the rubric text improves performances particularly 
well with the use of BOW representations, for both the logis- 
tic regression and neural network classifiers. With SBERT, 
the addition of the question text seemed to help the general- 
izability of the model as well. Interestingly, we do not see the 
same pattern between the classification models for the ques- 
tion and bundle holdout methods: where the neural net was 
clearly superior in the random holdout experiments, results 
are more similar between the logistic regression and neural 
network for the bundle and question holdout experiments. 


We see from the row averages in the right most columns of 
the table that across all experiments and text representa- 
tion types, the response and question text, as well as the 
response and context text achieve the highest evaluation 
scores. Additionally, the column averages further confirm 
that the SBERT representations perform best. 


Further, we include results from a majority class classifier 
on the top row for a baseline comparison. We emphasize 
that, across all random holdout experiments, the classifi- 
cation models outperform the majority class classifier sig- 
nificantly. However, this is not the case for the question 
holdout and bundle holdout experiments. For the question 
holdout experiments, many of the SBERT experiments out- 
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Figure 3: 2D visuals of input vector representations. Plots vary by representation type, input content, and color labeling. 


performed the majority class classifier, but on average, the 
W2V and BOW experiments performed a bit worse than 
majority class. For the bundle holdout experiments, on av- 
erage, SBERT performs slightly better than majority class, 
but the W2V and BOW experiments do not. 


Further interpretations can be made from Figure 4, which 
includes two dimensional, t-distributed Stochastic Neighbor 
Embedding (TSNE) reduced vector representations. The 
center image includes SBERT embeddings of only the re- 
sponse text and the colors represent the ground truth rat- 
ings. From the top left and right images, as well as the 
bottom left image we can see very distinct question clus- 
ters with the inclusion of question text. However, from the 
bottom right image, we can visualize question specific clus- 
ters, but they are not as distinct as the representations that 
include the question text. 


Thus, we conclude that in terms of question holdout, the 
model can generalize to out-of-sample questions with only 
slight improvement over majority class, with a state-of-the- 
art representation method like SBERT. Certain extra pieces 
of input information aid our models more than others like 
question text and the best performing models use the neural 
network classifier. 


6. DISCUSSION 


Although our results are not promising for the generalizabil- 
ity of autograding models to unseen questions, we emphasize 
the importance of finding more generalizable models to de- 
crease time spent on the laborous task of creating ground- 
truth human ratings. Our intention is that this work will 
influence researchers to consider further innovative methods 
to increase the generalizability of ASAG models. Further, 
because we did find that including certain question-related 
text may improve model performance, it may be of use to 
the ASAG research community to continue to explore how 
extra sources of information about a question may be incor- 
porated into an ASAG system. 


As is evident in our literature review, there has been in- 
creased adoption of state-of-the-art textual representation 
methods such as SBERT, and transformer-based models such 
as BERT and XLNet, within the field of NLP in Education. 
Our results support that such models may achieve superior 
performance for certain tasks. 


To build on this work further, we could consider other meth- 
ods, beyond that of concatenation to the input text, to in- 
clude the extra question information in our model. We could 
further pre-train a transformer-based model such as BERT 
or XLNet with the extra textual information by either tun- 
ing the existing weights or altering the existing architec- 
ture with an extra encoder layer of weights trained on our 
text alone. Moreover, we may focus more closely on how 
the classification model itself might be altered such that it 
might better generalize to out-of-training-sample questions, 
instead of only focusing on the input content. 


We believe that beyond model performance, the practical 
utility of an ASAG system must be considered in order for 
educators to continue to adopt new technologies that employ 
advanced methods in artificial intelligence. Recent years 
have seen vast improvements in the field of machine learning 
and language processing. Embracing such technologies for 
applications in education may be pivotal to provide the as- 
sistance that both educators and learners need. However, we 
do not suggest that machine learning systems such as ASAG 
should be used to replace human judgements in education, 
especially in high stakes testing scenarios. We emphasize 
the ASAG systems should be used to support educators, 
not replace them. This project represents a continued ef- 
fort to explore the ways in which we can make use of new 
technologies to improve learning. 
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ABSTRACT 


Sentiment Analysis is a field of Natural Language Process- 
ing which aims at classifying the author’s sentiment in text. 
This paper first describes a sentiment analysis model for stu- 
dents’ comments about professor performance. The model 
achieved impressive results for comments collected from stu- 
dent surveys conducted at a private university in 2019/20. 
Then, it applies the model to different scenarios: (i) in- 
person classes taught in 2019 (pre-COVID); (ii) the emer- 
gency shift to online, synchronous classes taught in the first 
semester of 2020 (early-COVID); and (iii) the planned online 
classes taught in the second semester of 2020 (late-COVID). 
The results show that students acknowledged the effort pro- 
fessors did to keep classes running during the first semester 
of 2020, and that the enthusiasm continued throughout the 
second semester. Furthermore, the results show that stu- 
dents evaluated professors’ performance for online courses 
better than for in-person courses. 


Keywords 


sentiment analysis, BERT, online classes, in-person classes 


1. INTRODUCTION 


The systematic evaluation of a Higher Education Institu- 
tion (HEI) provides its administration with valuable feed- 
back about several aspects of academic life, such as the rep- 
utation of the institution and the individual performance of 
faculty. In fact, in some countries, it is mandatory that HEIs 
implement self-evaluation committees, whose members are 
elected by the various segments of the community and whose 
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duties include the preparation of annual reports assessing 
the performance of the institution on predefined aspects. 


In particular, student surveys are a first-hand source of in- 
formation that help assess professor performance and course 
adequacy. Such surveys are typically organized as a ques- 
tionnaire with closed-ended questions, which the student an- 
swers by choosing predefined alternatives, and open-ended 
questions, which the student answers by freely writing com- 
ments on the topic of the question. Albeit interesting and 
useful, the analysis of open-ended questions poses challenges, 
such as how to summarize the comments and how to deter- 
mine the sentiment of the comments. 


The primary goal of this paper is to introduce a sentiment 
analysis model for students’ comments in the context of 
questionnaires designed to assess professor performance, and 
to evaluate the model using data from student surveys ap- 
plied at Brazilian University in 2019 and 2020. 


Studying this particular period of time is interesting be- 
cause, in early 2020, the COVID-19 pandemic forced the 
Brazilian University to move all classes online, taught with 
the help of a videoconferencing software and a Learning 
Management System (LMS), and they so remained through- 
out 2020. This change in instructional model offers the 
unique opportunity to compare the in-person classes in 2019 
(pre-COVID scenario), with the emergency shift to online, 
synchronous classes in the first semester of 2020 (early COVID 
scenario), and with the planned online classes in the sec- 
ond semester of 2020 (late-COVID scenario). Therefore, the 
second goal of this paper is to apply the sentiment analy- 
sis model developed to the case study data to compare the 
overall sentiment of the students’ comments about professor 
performance in these different scenarios. 


The results reported in this paper indicate that the senti- 
ment analysis model developed achieves good performance 
in the classification of the sentiment expressed by the stu- 
dents’ comments about professor performance. This model 
was separately applied to the different scenarios covered by 
the case study data. The results show that students ac- 
knowledged the effort professors did to keep classes run- 
ning during the first semester of 2020 (early-COVID sce- 
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nario), and that the enthusiasm continued throughout the 
second semester of 2020 (late-COVID scenario). These con- 
clusions are justified by the peak in positive comments ob- 
served in the first semester of 2020, as compared with the 
other semesters. Furthermore, the results show that stu- 
dents evaluated professor performance for online classes bet- 
ter than for in-person classes. To a large extent, these re- 
marks are consistent, for example, with the findings of a 
random-sample survey, conducted in late May 2020, involv- 
ing more than 1,000 US college students whose classes moved 
from in-person to completely online in early 2020 [13]. How- 
ever, they have to be cross-checked with other surveys con- 
ducted in 2019/2020 at the Brazilian University and else- 
where. 


The remainder of this paper is organized as follows. Section 
2 summarizes related work. Section 3 presents the case study 
used in the paper. Section 4 details the model for sentiment 
analysis. Section 5 describes the results obtained with the 
case study data. Section 6 contains the conclusions and 
directions for future work. 


2. RELATED WORK 


Sentiment Analysis (SA), also known as Opinion Mining, is 
a field of natural language processing (NLP) where the main 
focus is to automatically analyze people’s opinions and sen- 
timents [11]. According to Pang and Lee [15], for most of us, 
the decision-making process takes into consideration “what 
other people think”. Based on this assertion, it is easy to 
understand why SA is very popular in several domains, such 
as tourism, restaurants, movies, music, and, more recently, 
education. 


Chaturvedi et al. [5] addressed the essential task of elimi- 
nating “real” or “neutral” comments that do not express a 
sentiment. The article reviewed hand-crafted and automatic 
models for detecting subjectivity in the literature, compar- 
ing the advantages and limitations of each approach. Ahuja 
et al. [1] addressed the analysis of comments from one of 
the most popular Twitter platforms. As the comments are 
not structured, they used six techniques to pre-process the 
comments. They then applied two techniques (TF-IDF and 
N-Grams) to classify comments, and concluded that the TF- 
IDF word level of sentiment analysis is 3-4% higher than the 
use of N-characteristics. Prusa et al. [17] also concentrated 
on Twitter data. They analysed the impact of ten filter- 
based feature selection techniques on the performance of four 
classifiers. Nazare et al. [14] analyzed about 1,000 Twitter 
comments using various machine learning approaches, sep- 
arately or in combination, to classify the comments. Unlike 
other articles with traditional approaches to analyze the sen- 
timent of short texts, Li and Qiu [10] did not consider the 
relationship between emotion words and modifiers, but they 
showed how to mitigate these problems through the senti- 
ment structure and rules that captured the text sentiment. 
The results of an experiment with microblogs validated the 
efficacy of their approach. 


Analyzing comments from sales Web sites is important to de- 
tect if users are praising or criticizing the products they con- 
sume. Bansal and Srivastava [4] used the word2vec model to 
convert comments into vector representations using CBOW 
(continuous bag of words), which were fed to a classifier. 


Experimental results showed that Random Forests using 
CBOW achieved the highest precision. Khoo and Johnkhan 
[9] analysed comments from the Amazon Web site, using 
a new general-purpose sentiment lexicon, called WKWSCI 
Sentiment Lexicon, and compared it with five existing lex- 
icons. Akhtar et al. [2] used classification algorithms, like 
Conditional Random Filed (CRF) and Support Vector Ma- 
chine (SVM), to classify comments from different Indian 
Web sites. 


Zhou and Ye [22] reviewed journal publications between 
2010-2020 in SA applied to the education domain and, among 
others future research directions, they pointed out: (i) the 
need to explore SA in the learning cross-domain; (ii) con- 
sider a combination of text mining and qualitative answers 
(questionnaires or interviews) to understand the psycholog- 
ical motivation behind learning sentiment; (iii) explore the 
association between sentiment, motivation, cognition, and 
also demographic characteristics to regulate the emotions of 
learners. Santos et al. [19] studied SA in online students’ 
reviews to identify factors that influence international stu- 
dents’ choice for a HEI. They also suggested aspects that 
HEI managers may have to consider to attract more inter- 
national students, such as: online information about (HEI) 
offerings, students’ comments about their experiences, inter- 
national environment, courses taught in English, and sup- 
port to students’ accommodation or expenses. Sindhu et 
al. [20] proposed an aspect-oriented SA system based on 
Long Short-Term Memory (LSTM) models. They consid- 
ered two datasets with students’ comments, namely: the 
Sukkur IBA University and a standard SemEval-2014. They 
suggested that the evaluation of teaching performance would 
have to consider six dimensions: teaching pedagogy, behav- 
ior, knowledge, assessment, experience, and general. We 
previously created a tool for the analysis of student com- 
ments [8] but it was limited to a fixed, manually created 
dictionary, which might therefore not take into account some 
relevant words. 


The choice of a university to enroll in is a difficult decision 
and, at the same time, the information available on the inter- 
net is overwhelming. To address these issues, Balachandran 
and Kirupananda [3] proposed an aspect-based sentiment 
analysis tool to evaluate the reputation of universities in Sri 
Lanka from users’ comments in Facebook and Twitter, using 
the StanfordCoreNLP library to perform sentiment analysis. 
Lytras et al. [12] built the Learning Analytics Dashboard 
for E-Learning (LADEL) tool to monitor different sources, 
such as student blogs, social networks and Massive Open On- 
line Courses (MOOC) in search of comments that express 
satisfaction, anxiety, efficiency, frustration, abandonment. 
LADEL is composed of four modules: collection, cleaning, 
word cloud and sentiment of opinion. Sivakumar and Reddy 
[21] extracted students’ comments using the Twitter API 
and tried to analyze the relations between word aspects and 
phrases of student opinion. They used a sentiment package 
available in R to find the polarity of the sentences and then 
applied k-mean clustering and naive Bayes for the sentiment 
analysis classification. 


de Oliveira and de Campos Merschmann [6] analyzed the 
combination of NLP pre-processing tasks (tokenization, POS 
tagging, stemming, among others) with three classifiers (Ran- 


354 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


dom Forest, Support Vector Machine, and Multilayer Per- 
ceptron), and discussed their predictive performance. They 
evaluated these tasks in five Portuguese datasets related 
to sentiment analysis, encompassing comments, news and 
tweets. They analyzed some combinations of preprocessing 
tasks and classifier 


This paper focuses on identifying students’ sentiments ex- 
pressed in comments about professor performance in Higher 
Education. It uses the pre-trained model called Bidirec- 
tional Encoder Representations from Transformers (BERT) 
[7] for the sentiment analysis task. BERT-style models are 
the current state-of-the-art in several NLP tasks, including 
entity recognition and sentiment analysis. BERT’s archi- 
tecture is based on multi-layered transformers, which are 
particularly optimized to be trained on GPUs and TPUs 
with significant amounts of data. For this reason, a recipe 
for success with these models is to pre-train them with large 
datasets (in the order of millions of documents) on general 
tasks such as masked language models or next sentence pre- 
dictions [7]. This pre-training allows the model to learn a 
lot about some language patterns (that are independent of 
the task we care about) and make it easier to train them 
specifically for other language tasks even without the need 
for large amounts of annotated data. Our corresponding 
code is available at GrrHus’. 


3. CASE STUDY 


3.1 Course Survey Data 

In the rest of this paper, we use course to denote “a series 
of lectures in a particular subject”, and class to describe “a 
particular instance of a course”. Therefore, students enroll 
in a class of a course. We assume that classes run on a per 
semester basis, and use <year>.1 and <year>.2 to denote 
the first and second semesters of the calendar year, respec- 
tively. 


Since 2005, at the Brazilian University used in the case 
study, students are invited, at the end of each semester, 
to answer a questionnaire for each class they took in the 
semester. Students’ participation in the survey is not manda- 
tory. The questionnaire has a set of closed-ended questions 
about the professor that taught the class, and a separate 
set of closed-ended questions about the course the class is 
an instance of. For each closed-ended question, the student 
chooses a score from a Likert scale (1-5). The questionnaire 
also has one open-ended question which invites students to 
write as many sentences as they like to express their evalu- 
ation of the professor that conducted the class, and likewise 
for the course the class is an instance of. The comments are 
in Portuguese and the sentences are often ungrammatical. 
We are interested in the sentiment analysis of the students’ 
comments about the professor performance, which we will 
refer to as the comments for brevity. 


The purpose of the case study is to analyse comments col- 
lected from the questionnaires applied in the first and sec- 
ond semesters of 2019 and 2020. However, we also use the 
comments collected from the questionnaires applied in both 
semesters of 2018 for pre-training (see Section 5). The rea- 


‘https: //github.com/hguillot /Sentiment-Analysis-of- 
Student-Surveys-with-BERT 


Table 1: Number of comments about professor performance 
in classes. 


Semester #4Comments 
2018.1 and 2018.2 10,077 
2019.1 3,182 
2019.2 1,910 
2020.1 3,492 
2020.2 2,219 


Table 2: Structure of the professor questionnaires. 


Year | Class Mode | #Closed-ended | #Open-ended 
Questions Questions 

2018 in-person 10 1 

2019 in-person 16 1 

2020 online 20 1 


son for using comments from 2018 for pre-training is that 
we wanted to make sure that no comment used in the anal- 
ysis step has been observed before in the pre-training step. 
Using the 2018 data is possible because it has been observed 
that the vocabulary students use to write comments has not 
changed significantly over the years. Table 1 presents the 
number of comments for the 2018, 2019 and 2020 student 
surveys. 


As far as professor evaluation is concerned, the question- 
naires varied slightly from 2018 to 2019. Also, in early 
2020, the COVID pandemic forced the university to move all 
classes online, taught with the help of a videoconferencing 
software and a Learning Management System (LMS), and 
they so remained throughout 2020. The questionnaire, used 
for classes taught in 2020.1 and 2020.2, was then modified 
accordingly. Table 2 summarizes the structure of the various 
questionnaires that the case study is concerned with. 


Given this new reality, forced by the COVID pandemic, it is 
reasonable to ask if the professors were prepared for online 
classes and if this would affect the students’ evaluation of 
the professor performance at the end of 2020.1 (the early- 
COVID scenario). 


As a simple answer to this conjecture, consider the last 
closed-ended question incorporated in the 2019/20 surveys: 
“O: Overall evaluation of the professor”. Figure 1 depicts 
the distribution of the scores of Question O per semester, 
grouped as 1 and 2, for “negative”, 3, for “neutral”, and 4 
and 5, for “positive”, considering only questionnaires with a 
non-empty comment about professor performance. Figure 
1 shows that students in fact evaluated the overall profes- 
sor performance better in 2020.1 (again, the early-COVID 
scenario) than in the other semesters. 


But the question remains if the overall sentiment of the com- 
ments about professor performance points in the same direc- 
tion. 


3.2 Use of the Course Survey Data 

This section describes how the course survey data were used 
to construct models for the sentiment analysis of comments 
about professors performance (recall that each questionnaire 
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Figure 1: Distribution of the scores of Question O per 
semester (considering only questionnaires with a non-empty 
comment about professor performance). 


has only one such comment). 


We first observe that no text pre-processing was necessary, 
as in the Twitter sentiment analysis reported in [16], since 
the students’ comments do not significantly depart from 
written Portuguese, albeit they often contain ungrammat- 
ical sentences. 


The models used manually annotated comments, obtained 
as follows. From the course surveys of the two semesters 
of 2019, 800 questionnaires with non-empty professor com- 
ments were randomly chosen, using the following criteria: 
5 samples were chosen for each of the Likert scale scores 
(1-5) for each of the 16 closed-ended questions (5 * 5 * 16 
= 400) in each of the semesters (400 * 2 = 800). The com- 
ments of the selected questionnaires were manually classified 
into 3 categories: positive, when the comment only praised 
the professor; negative, when the comment only criticized 
the professor; and neutral when the comment expressed no 
opinion or when the comment both praised and criticized 
the professor. Table 3 shows the number of comments in 
each of these classes. 


The pre-training step (see Section 5) used data from the 2018 
student surveys as follows. We considered a dataset with all 
questionnaires with non-empty comments from the 2018 stu- 
dent surveys. But, since the questionnaire applied in 2018 
had no overall professor evaluation (Question O), we used 
the average score Savg[g] € [1,5] of all questions of a ques- 
tionnaire q to induce a label c[q] € { “negative”, “neutral”, 
“positive” } for the comment as follows: if savglq] < 3 then 
clq] = “negative”; if 3 < savglq] < 4 then c[g] = “neutral”; 
and if Savgl[q] > 4 then c{q] = “positive”. Figure 2 shows 
the distribution of the average scores obtained. 


4. ASENTIMENT ANALYSIS MODEL 


In the paper, we focus on the polarity classification task, 
whose focus is to classify comments, which express opinions 
or reviews, into “positive”, “negative” or “neutral”, or even 
into more than these three classes. We neither consider sub- 
jectivity classification, i.e., the task of verifying the subjec- 
tivity and objectivity of a comment, nor irony detection, i.e., 


the task of verifying whether the comment is ironic or not. 
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Figure 2: Distribution of the average score of all questions of 
a questionnaire from 2018. 


Table 3: Distribution of the number of questionnaires per 
class of comment about professor performance, using the 
manual classification and the automatic classification induced 
by the score of Question O (considering 800 questionnaires 
with a manually classified comment about professor perfor- 


mance). 
Year | Classification | Positive | Negative | Neutral 
2019.1 Manual 107 220 73 
Automatic 187 150 63 
2019.2 Manual 119 203 78 
Automatic 201 138 61 


We use BERT [7], which achieves outstanding results on 
a number of NLP tasks. The core of the architecture has 
been pre-trained on a very large amount of unlabeled data. 
The model is then fine-tuned on small supervised datasets, 
designed for the task in question. 


For our case study, BERT encodes each comment into a 
768-dimensional embedding and, then, a dense layer trans- 
forms the embeddings into a three-dimensional vector for 
each comment that indicates the probability that the com- 
ment belongs to each of the three classes - “positive”, “neg- 
ative” or “neutral”. We adopted the BERT-Base, Multilin- 
gual Cased version? (for 104 languages, with 12-layer, 768- 
hidden, 12-heads, 110M parameters), which is required since 
the comments are written in Portuguese. In order to signif- 
icantly speed up the training and inference with our model, 
we limited the size of each input comment to 64 tokens, 
which is enough to cover the vast majority of the comments. 
Any comment with less than 64 tokens was padded with the 
‘[PAD]’ symbol already allocated in BERT’s vocabulary and 


any comment with more than 64 tokens was truncated. 


Finally, the model was implemented using KERAS and run- 
ning on GPU’s. 


5. EXPERIMENTS AND RESULTS 


We started our experimental setup by executing a pre-training 
step that aims at getting the model used to the style of stu- 


? Available at https: //github.com/google- 
research/bert/blob/master/multilingual.md 


356 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


Table 4: Results of the experiments. 


Experiment | Accuracy | Precision Recall Fl 
Zero-shot 50.2+2.3 | 54.2+2.2 | 51.8+2.8 | 53.0+2.4 

From scratch | 86.3+1.8 | 84.5+2.3 | 83.0+3.1 | 83.7+2.4 
Fine-tuned 87.542.0 | 84.642.0 | 84.8+2.0 | 84.6+2.5 


dent’s comments through non-annotated data. In order to 
do that, we used the set of comments from the 2018 stu- 
dent surveys, and the scores they assigned to the professor 
as a proxy to the labels, as explained in Section 3.2. We 
started the pre-training experiment with the multilingual 
BERT checkpoint that is publicly available and trained for 
10 epochs, resulting in a newly trained checkpoint which we 
call from this point on the pre-trained checkpoint. 


After the pre-training step, we proceeded to experiment with 
three setups, using a 5-fold cross-validation strategy, applied 
to the set of 800 manually classified comments. Therefore, 
each round of cross-validation used 640 comments for train- 
ing and 160 comments for testing. The three setups we used 
were as follows: 


e Zero-shot: this experiment does not perform any train- 
ing with the manually classified comments. Instead, it 
performs inference directly using the pre-trained check- 
point that resulted from the pre-training step on the 
test set. If this model’s performance was good, then 
it would show that manually annotating comments 
would not be necessary. 


e From scratch: this experiment does not use the pre- 
trained checkpoint that resulted from the pre-training 
step. Instead, it starts with the multilingual BERT 
checkpoint and uses the manually classified comments 
to train and evaluate the model. The objective of this 
experiment is to understand if the pre-training step is 
necessary to obtain top-quality results. 


e Fine-tuned: this experiment uses the pre-trained check- 
point that resulted from the pre-training step and then 
uses it as the starting point when training with the 
manually classified documents. This experiment aims 
at evaluating if combining pre-training and manually 
annotated comments helps in obtaining top-quality re- 
sults. 


Table 4 shows the results of the 5-fold cross-validation (each 
cell indicates the average and the standard deviation over 
the 5 rounds). Observe that the fine-tuned model obtained 
the best results, which indicates that combining pre-training 
and manually annotated comments helps in obtaining top- 
quality results. 


We have also computed the Fisher-Irwin test [18], to ex- 
amine the hypothesis that Fine-tuned model does not have 
an equivalent classification performance when compared to 
both Zero-shot and From scratch. For this purpose, we com- 
puted the Fisher-Irwin test twice. In the first test, our null 
hypothesis (Fine-tuned classifier has a proportion of correct 
classifications equivalent to the proportion of correct classifi- 
cations from Zero-shot classifier) was tested against the al- 
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Figure 3: Accuracy for From scratch and Fine-tuned using 
train set of 40, 80, 160, 320 and 640 comments. 


ternative hypothesis (Fine-tuned classifier has a proportion 
of correct classifications superior to the proportion of correct 
classifications from the Zero-shot classifier), and the null hy- 
pothesis was rejected for the usual levels of statistical signifi- 
cance (5% and 10%). The same happened in our second test 
where our null hypothesis (Fine-tuned classifier has a pro- 
portion of correct classifications equivalent to the proportion 
of correct classifications from From scratch classifier) was 
tested against the alternative hypothesis (Fine-tuned classt- 
fier has a proportion of correct classifications superior to the 
proportion of correct classifications from the From scratch 
classifier). Based on this, we can conclude that our results 
are statistically significant, since our null hypotheses were 
both rejected for the usual levels of statistical significance 
(5% and 10%), leading us to accept alternative hypotheses. 


An important question that arises is about the number of 
comments that must be manually annotated to achieve an 
acceptable level of accuracy. To address this question, we 
ran the following cross validation experiment, with a de- 
creasing number of manually annotated comments used for 
training. We divided the 800 manually annotated comments 
into 5 sets of 160 comments each. Let G1, ...,Gs5 denote these 
sets and G; denote the 640 comments not in G;. For each 
i =1,...,5, we computed the accuracy and the F1-score of 
the from-scratch and the fine-tuned models, using G; for 
testing and subsets of Gj, of sizes 640, 320, 160, 80, and 40, 
for training. Finally, for each cardinality of the training sets, 
we computed the average accuracy and the average F1-score 
of each model. Figures 3 and 4 depict the results. 


Figure 3 shows that, using 640 manually annotated com- 
ments for training, the fine-tuned model achieved an aver- 
age accuracy of 87.5% and the from-scratch model achieved 
86.3%, and so on for the other training set cardinalities (320, 
160, 80 and 40). Therefore, based on the level of accepted 
accuracy, one can balance the effort to manually annotate 
the comments. 


Figure 3 also shows that: (i) using just 40 manually anno- 
tated comments for training, the fine-tuned model achieved 
an average accuracy of 77.1%, while the from-scratch model 
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Figure 4: F1 for From scratch and Fine-tuned using train 
set of 40, 80, 160, 320 and 640 comments. 


only achieved an accuracy of 70.8%, when trained with 160 
comments, that is, 4 times as much comments; (ii) the fine- 
tuned model, again trained with just 40 comments, achieved 
a much better accuracy than that of the zero-shot model, 
shown in the first line of Table 4 (the zero-shot model is 
the equivalent to training the fine-tuned model with 0 com- 
ments); (iii) the pre-trained check-point had a positive im- 
pact, since the fine-tuned curve is always above the from- 
scratch curve; (iv) the fine-tuned model achieved a standard 
deviation smaller than that of the from-scratch model, which 
means that this technique is more stable and less susceptible 
to changes due to the samples. These observations reinforce 
that, with an adequate pre-training strategy, we may achieve 
good results without the need to manually annotate a large 
amount of data. 


Finally, we used the fine-tuned model to classify the full set 
of comments from the 2020.1 and 2020.2 surveys, and the set 
of comments from 2019.1 and 2019.2 that were not manually 
classified. Then, we added the manually classified comments 
from 2019.1 and 2019.2 to obtain the final distributions for 
the four semesters, as shown in Figure 5. 


For comparison purposes, Figure 5 includes the distributions 
of the comment classifications induced by the score of Ques- 
tion O as explained in Section 3.1. Note that Question O 
induces a classification biased towards positive comments, 
when compared with the classification based on the fine- 
tuned model. This is also observed when just the manually 
classified comments are considered. 


In conclusion, the distributions of the students’ comments 
sentiment and of the scores of Question O indicate that stu- 
dents evaluated the professor performance better in 2020.1 
(the early-COVID scenario) than in the other semesters, 
which seems to indicate that students acknowledged the ef- 
fort professors did to keep classes running during 2020.1, 
and that the enthusiasm continued throughout 2020.2 (late- 
COVID scenario). Furthermore, students evaluated the pro- 


fessor performance better in 2020.1 and 2020.2 (online classes), 


by a margin of nearly 10%, when compared with 2019.1 and 
2019.2 (in-person classes), respectively. 
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Figure 5: Distribution of the final classification of the com- 
ments from all surveys, using the fine-tuned model, added 
to the manually classified comments from 2019.1 and 2019.2 


(shown in blue), and the classification of the comments from 
all surveys, using the score of Question O (shown in orange). 


6. CONCLUSIONS 


This paper first described a sentiment analysis model for stu- 
dents’ comments about professor performance. The model is 
based on BERT and has achieved good results when applied 
to a case study with students’ comments about professor 
performance, obtained in 2019/20. 


Then, the paper applied the model to compare the overall 
sentiment of the students’ comments about professor perfor- 
mance in different scenarios: in-person classes in 2019.1 and 
2019.2 (pre-COVID scenarios); the emergency shift to on- 
line, synchronous classes in 2020.1 (early COVID scenario); 
and the planned online classes in 2020.2 (late-COVID sce- 
nario). The results show that students acknowledged the 
effort professors did to keep classes running during 2020.1, 
and that the enthusiasm continued throughout 2020.2. Fur- 
thermore, the results show that students evaluated profes- 
sor performance for online courses better than for in-person 
courses, by a margin of nearly 10%, which seems to indicate 
that students favor online classes. 


This paper also discussed the number of comments that must 
be manually annotated to achieve good results. Future ex- 
periments can take advantage of this discussion to reduce 
the manual annotation effort, even with datasets obtained 
from other universities. 


The stability of the models was also investigated, indicating 
that the fine-tuned model achieved a lower standard devia- 
tion, which means that this technique leads to more stable 
results. The fine-tuned model also achieved a higher per- 
formance, when compared to both the zero-shot and from- 
scratch models, in terms of the proportion of correct classi- 
fications, and the difference was statistically significant. 


We plan to extend the analysis to past student surveys, 
which go back to 2005, and to the student survey to be 
applied at the end of 2021.1, when classes will still be on- 
line. We also plan to cross-check these preliminary findings 
with other surveys conducted in 2019/20 at the Brazilian 
University and elsewhere. 
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ABSTRACT 


Influenced by Covid-19, online learning has become one of the 
most important forms of education in the world. In the era of 
intelligent education, knowledge tracing(KT) can provide excellent 
technical support for individualized teaching. For online learning, 
we come up with a new knowledge tracing method that integrates 
mathematical exercise representation and association of 
exercise(ERAKT). In the aspect of exercise representation, we 
represent the multi-dimensional features of the exercises, such as 
formula, text and associated concept, by using ontology 
replacement method, language model and embedding technology, 
so we can obtain the unified internal representation of exercise. 
Besides, we utilize the bidirectional long short memory neural 
network to acquire the association between exercises, so as to 
predict his performance in future exercise. Extensive experiments 
on a real dataset clearly proved the effectiveness of ERAKT 
method, they also verified that adding multi-dimensional features 
and exercise association can indeed improve the accuracy of 
prediction. 


Keywords 


Knowledge tracing. Context-aware. Exercise representation. 


1. INTRODUCTION 


As one of the key technologies of adaptive learning, knowledge 
tracking has become a research hotspot in adaptive education. The 
main task of knowledge tracking is to automatically track students' 
acquisition knowledge level with time according to their historical 
learning trajectory, so as to accurately predict their performance in 
future learning. In actual teaching, teachers can adjust teaching plan 
dynamically by predicting the result, improve teaching quality and 
teaching efficiency, and help teachers to achieve accurate teaching 
goal. 


Knowledge Tracing method (KT) was first proposed by Atkinson. 
Bayesian knowledge tracing method (BKT) [1] is one of the most 
popular knowledge tracing methods in the early stage. BKT 
assumes that students will never forget a knowledge concept once 
they have mastered it, which is not in line with the actual teaching 
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situation. Later, with the continuous development of deep learning, 
more and more scholars combined knowledge tracing tasks with 
deep learning methods, among which Deep Knowledge Tracing 
(DKT) [2] is the most popular and commonly used one. DKT partly 
solves the assumption error problem in BKT which does not 
conform to the actual teaching situation, and can more accurately 
represent the concept proficiency of learners. However, the 
assumption of concept state represented by a hidden layer in DKT 
is inaccurate, making a student's mastery level difficult to track. 
Furthermore, Jiani Zhang et al. proposed Dynamic Key-Value 
Memory Networks for Knowledge Tracing (DK VMN) [3] based on 
memory neural networks, and DK VMN is significantly better than 
BKT and DKT in terms of performance effect. In recent years, the 
University of Science and Technology of China team proposed 
some methods which integrated exercise records and exercise 
materials into KT based on the existing KT methods, such as 
EKT[11], qDKT[12], etc., which has stronger explanatory power 
and gradually improved performance effect. 


However, most of the traditional knowledge tracing methods only 
consider the exercises records of students, using the covered 
concepts to index the exercises, ignoring the influence of exercise 
formula, text or concepts on a student's knowledge state. In fact, 
besides exercise interaction records, the multi-dimensional 
information of exercises has an important impact on a student's 
performance. Therefore, in order to solve the above problems, we 
propose a mathematical knowledge tracing method that integrates 
the representation and association of a student’s exercises, so as to 
solve the problem of information loss caused by ignoring multi- 
dimensional representation and association of exercises in 
traditional knowledge tracing and improve accuracy of the method. 
Contributions of ERAKT are as follows: 


(1) We propose a new context-aware knowledge tracing method 
that can automatically learn and predict a_ student’s 
performance in the next exercise. 

(2) We propose an exercise representation method that integrates 
multi-dimensional information, including text, formulate and 
associated concept. 

(3) We propose a sequential question association mining method 
based on a bidirectional neural network to acquire association 
content between exercises. 


2. RELATED WORK 


2.1 Semantic representation 

In the domain of text processing, the most important task is to 
transform text into a vector form which could be understood and 
processed by computers, that is, semantic representation. There are 
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many ways to get semantic representation, such as Word2vec, 
TextCNN, FastText, Bert etc. 


Word2Vec[4] was proposed by Google’s Mikolov team. It is the 
preliminary application of neural networks in semantic 
representation. Word2Vec conducts fixed-dimensional vectors to 
represent words. For sentences, the Doc2Vec method [5] is derived, 
which establishes a model by means of a neural network structure. 
Paragraph vectors are obtained during the training of the model. 
Both of them are typical unsupervised text representation methods. 
Compared with the traditional bag-of-words method, they can 
better integrate the internal information of exercise, such as context, 
semantics and word order. 


Yoon Kim modified the input layer of traditional CNN. In 2014, he 
proposed a text classification method named TextCNN[6], which 
has a simpler network structure, smaller amount of calculations and 
faster speed of training. So compared with traditional CNN method, 
TextCNN performs better in field of semantic representation. 
Another method to get semantic representation is FastText [7]. 
FastText can train word vectors by itself without requiring pre- 
trained word vectors, which speeding up training and testing while 
maintaining high accuracy. 


The Bert [8] method was also proposed by the Google team to solve 
the semantic representation problem. To deal with the effects of 
polysemous words in sentences, Bert exploits the transfomer model 
with Self-attention and Multi-Headed Attention mechanisms [9], 
which combines the context of the sentence to determine the 
specific semantics. Bert has a two-way function, allowing for more 
accurate results and adaptive learning in a multi-tasking 
environment. 


2.2 Knowledge tracing considering multi- 


dimensional characteristics of exercises 

With the rapid development of deep learning, more and more 
scholars exploit different deep learning methods to represent 
exercises in order to complete the knowledge tracking task and 
attempt to comprehensively consider the impact of different 
features of the exercises on knowledge tracing tasks. The multi- 
dimensionality mainly include textual materials and concepts 
involved in the exercises. 


Therefore, the majority of existing KT methods utilize concepts to 
index exercises to avoid over-parameterization. For example, both 
DKT and DKVMN treat all the exercises covering the same 
concept as a single one. Compared with the former, the key-value 
matrix in DKVMN extends the hidden feature representation of the 
exercises, but it still does not take advantage of the characteristics 
of other dimensions. The Prerequisite-driven deep knowledge 
tracing(PDKT) [23] method integrates the structural information 
between concepts with the help of the Q matrix in the cognitive 
diagnosis theory, and specifically considers the contextual 
relationship between concepts. The self-attentive knowledge 
tracing (SAKT) [24] method exploits concepts to index exercises 
and introduces a self-attention mechanism to consider the degree of 
relevance between concepts. The Context-Aware Attentive 
Knowledge Tracing method(AKT) [13] utilizes a novel monotonic 
attention mechanism that relates a student’s future responses to 
assessment exercises to their past responses; attention weights are 
computed using exponential decay and a context-aware relative 
distance measure, in addition to the similarity between exercises. 
Moreover, AKT utilizes the Rasch model to capture individual 
differences between exercises. The Graph-based Knowledge 


Tracing(GKT) [25] also introduces concepts to index exercises, at 
the same time constructs a graph method to represent the 
association between concepts, updates the student's knowledge 
status through the GRU mechanism. 


The exercise materials which KT methods consider are mainly text 
materials and the concepts covered. EERNN (Exercise-Enhanced 
Recurrent Neural Network) [10] and qDKT(Question centric Deep 
Knowledge Tracing)[12] predict a student’s performance only by 
making full use of his practice records and text of exercises. EKT 
(Exercise-aware Knowledge Tracing for Student Performance 
Prediction) [11] is an improved method based on EERNN. It is the 
first method to comprehensively consider the influence of a 
student's practice records and exercise materials (concepts and text 
contained) on his performance. But it is worth noticing that in EKT 
exercise text is represented by LSTM, due to its internal structure 
problem, LSTM can't parallel computing, resulting in dealing with 
text slower, so the effect is not very satisfactory. Exploring 
Hierarchical Structures for Recommender Systems (EHFKT)[21] 
make full use of concepts, difficulty, and semantic features to 
represent exercises. The first two are embedded using TextCNN, 
and semantic features are extracted using Bert. The Introducing 
Problem Schema with Hierarchical Exercise Graph for Knowledge 
Tracing(HGKT) [22] method exploits a hierarchical graph neural 
network to learn the graphical structure of the exercises. It also 
introduces two attention mechanisms to better mine knowledge 
state of learners, and utilizes the K&S diagnosis matrix to obtain 
the diagnosis result. 


3. PROPOSED METHOD 


The student's knowledge mastery is tracked by observing his 
interactive information in different exercises. It is a supervised 
learning sequence prediction problem in the field of machine 
learning. In this section, we will first define the problem and then 
describe the proposed method in detail. 


3.1 Problem Definition 


Assuming that each student does exercises separately, we define the 
student sequence s = {(€1,1;), (€2, 12), «+ ,(er,Tr)}, where er 
belongs to exercises sequence E, which represents the exercises 
done by the student at time T. Usually, 0 or 1 is applied to mark 
whether his answer is correct, i.e. 1 means correct, 0 means wrong. 
In the ERAKT method, we add the text content of each exercise 
into the student’s exercise sequence E, because we mainly focus 
on knowledge tracing task in mathematics, which generally contain 
not only words but also specific mathematical elements, so the text 
is represented as e={w,f}, where w= {wy,Wy,...... ,Wmt 
represents the words in the exercise, and f= {fj, fy,...... fu} 
represents the mathematics components. Furthermore, the 
corresponding concept is a key information in KT task, k= 
{ky, Ko, .. , Ky}, which is summarized into a concepts matrix for 
embedding, and the student sequence after embedding turns into 
sequence S = {(e1, ky, 1"), (e2, k2,12), aceon , (er, Kr, rr)}, seéS. 


The ultimate goal of our ERAKT method is to track a student’s 
knowledge status through hidden layers to predict his performance 
in future exercise, that is, his response to exercise e741 at the next 
moment T+1. Besides, we take into account the student's record of 
exercises, the text content of the exercises and the concepts 
included. The ERAKT framework is shown in Figure 1. The 
method includes three major parts: exercise representation module, 
exercise association module and the performance prediction 
module. 
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Figure 1. ERAKT framework. (a) The process of obtaining the exercise representation. After the replacement of the formula, we splice the 


formula with the original text, and the final exercise representation vector is obtained through 3 embedding layers and 12 layers of encoder. 


(b) Combine the exercise representation vector with the knowledge concepts and student response corresponding to the exercise, and input 


them into the Bi-LSTM network together to obtain the student's hidden state representation. (c) The student response prediction part, which 


predicts the student's response to the exercise at time T+1. 


3.2 Exercise Representation Method 

The information of the exercise includes the material itself and the 
concepts associated with it. To realize the unified representation of 
mathematics exercises, firstly, we need to construct the different 
dimension features in the exercises to represent the material itself 
and its associated concepts, and then integrate the feature 
representations of multidimensional exercises into a unified feature 
vector. 


3.2.1 Preprocessing of exercise formula 

Mathematical exercises involve text and formulas, which are 
represented by LaTex, and we need to convert formulas into unified 
text expressions in advance. Since the LaTeX formula in the 
exercises follows a set of unified coding rules, we first replace 
LaTeX formulas uniformly through ontology replacement method, 
then perform unified preprocessing together with other text. 


In the unified preprocessing of the formula text, the entities and 
attributes in the exercises are identified and replaced from entities 
to attributes. During the replacement process, the replaced entities 
or attributes and the replaced forms are saved in a dictionary. , That 
means, replace these formula texts f = {f,, fo, ..... fu} to obtain 
f = {fils fetes , fu}. The partial replacement relationship is 
shown in Table 1. 


Table 1. Formula replacement table. 


f f 
complement thee 
sqrt ARI 
‘ sa 
- ms 
cm BX 
pi x 


After the replacement, splice f with the text w= 
{Wy, Wa, oe ee ,Wy} according to the original position, and get the 
text representation of the exercise W . 


®*=w@f (1 


For exercise representation W, we apply python's Chinese word 
segmentation package Jieba for word segmentation. Jieba has three 
segmentation modes: precise mode, full mode and search engine 
mode. Here we utilize precise mode to accurately segment a 
complete sentence into independent words according to the 
segmentation algorithm. Then use a self-built stop word list to 
delete some words that cannot express specific meanings. This 
specific process is shown in Figure 2. 
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Figure 2. Example diagram of exercise formula preprocessing. 


3.2.2 Vector representation of multi-dimensional 


exercise 

In view of the superiority of Bert in the field of natural language 
processing, we exploit Bert to conduct self-supervised learning and 
training on the text features of exercises. We employ three 
embedding layers and a 12-layer encoder to pre-train the exercise 
representation. As shown in the question representation part in 
Figure 1, it includes three layers. The function of the token 
embedding layer is to convert each word segmentation into a 768- 
dimensional vector, then the segment embedding layer 
distinguishes differences between the vectors of two sentences, the 
position embedding layer can help understand the order of words. 
When all the embedding processes are done, we add the results of 
each layer element by element to get the input of the Bert encoder 
layer. After encoding by the 12-layer encoder, we can get the final 
text vector representation br. 


br = €roxen + C€segment + Position (2) 


In this way, semantic information of exercises can be obtained 
automatically without any extra expert manual coding. 


3.2.3. Representation of concepts 

In the dataset, each exercise will be marked with concepts involved, 
k = {ky, ko, ... , Ky}. With reference to EKT, we select the first 
concept k,,which is also the most relevant one to represent the 
concept involved in exercise, then all the concepts of exercises are 
encoded into a vector of length |E| through one-hot vector 
encoding, E is the set of exercises, |E| represents the number of 
exercises, the encoded vector is adjusted by a layer of sigmoid 
activation function into the concepts representation cr. 


After getting the text representation and concepts representation, 
we concatenate them together as the final exercise embedded 
matrix. 


Xp = br Ocr (3) 


@® means concatenate two vectors in a certain dimension, and the 
length of vector obtained after concatenating is |E|+768. 


3.3 Exercises association modeling 

After obtaining the final merged exercise embedding matrix, we 
need to model the entire exercising process of each student and 
obtain his hidden state at each step, it will be affected by both the 
history exercises sequence and his responses. 


3.3.1 Student response embedding 

First of all, we combine the student's response with each exercise 
representation. Specifically, at each step t,we combine exercise 
embedding x7; with the corresponding score rr as the input to the 
recurrent neural network. 


We first extend the score ry toa feature vector 0 = (0,0,...,0) 
with the same dimensions of exercise embedding x7 and then 
learn the combined input vector X7 as: 


rr = 0 (4) 


After the concatenating, the student sequence becomes s = 
{Ray Kap ews Xp}. 


3.3.2 One-way time series knowledge tracing 
Like the original DKT method, a hidden layer is applied to track 
changes in student's knowledge status, the formula is as follows: 


ip = oW,-[hp-1 lth) (5). 
fr = o(Wy-[hr-1Xr]+be) (6) 
Or = OW: [hr-1,%7]+bo) (7) 
cr = tanh(W,: [hr_1,X7r] +b.) (8) 
Cr = frsCratip: oF (9) 
hr = or‘ tanh(cr) (10) 


Where cr is the long-term state at time T, ir, fr, and or are the 
input gate, forget gate, and output gate in LSTM respectively. tanh 
represents the tanh activation function, tanh(z;) = (e7‘ — e~‘)/ 
(e7: + e~:), o represents sigmoid activation function, o(z;) = 
1/1 + e7“). 


3.3.3 Bidirectional time series knowledge tracing 

In order to better obtain the association between the exercises, we 
introduce a bidirectional long and short-term memory neural 
network [14] to obtain the hidden state representation of the 
students, because Bi-LSTM can make full use of the exercises 
representation in both forward and backward directions [15], it can 
obtain the association between the exercises. Specifically, after 
getting s = {X7,%,..... ,X7}, we set the input of the first layer of 
LSTM to Ah =h© = {KZ Hay one one ,X7}, at each time T, the 


forward hidden state of each layer oe, ae) and backward 
hidden state (ie, ar) updates with the input from previous layer 
from each direction. The specific formula is as follows: 


AD cp =USIM( Re ne ye. tusin) “Gt 


TO <(l y(l-1) 7 © h) 
AOE = LSTM (hy! Rate Bisrm) (12) 


The association between exercises can be captured by Bi-LSTM. 
Since the hidden state of each direction only contains the 
association of one direction, it is beneficial to combine the hidden 
state of both directions together to obtain the final student hidden 
state representation: 


Hr = concatenate (ae ng?) (13) 


3.4 Student performance prediction 

After the above steps, we get the student's hidden learning state 
sequence {H,,H)p,...... H; } and the exercise sequence 
{004 Xo, wee one x7}, both of which will affect the student's final answer. 
We utilize two layers of bidirectional neural networks to obtain the 
predicted student performance, as shown in the formula: 


Yra1 = Tanh(W,: [Hp ® py] + by) (14) 
Trad = O(W2* Yr41 + bo) (15) 
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Figure 4. Four indicators performance after adding multi-dimensional feature. 


The first layer uses the Tanh activation function, and the second 
layer uses the sigmoid activation function. After two layers, the 
final prediction result f- is obtained. It is a scalar, which 
represents the probability of answering the question er correctly. 


3.5 Training and optimization 

The method is optimized by conducting the binary cross-entropy 
loss function, which calculates the loss between the true response 
rr and the probability of correct answer fr, and adjusts the 
model parameters such as exercise embedding parameters and 
student response embedding by inverse transfer until the value 
converges. The loss function is defined as: 


£=-) Grog + (—rr)logt—F)) (16) 


4. EXPERIMENTS 


In order to ensure the reliability of the experimental results, we 
carry out several baseline comparison experiments on a real 
dataset. This section will focus on the selection of data set and the 
comparison of benchmark models, as well as a discussion of the 
final experiments on a real dataset. This section will focus on the 
selection of data set and the comparison of benchmark models, as 
well as a discussion of the final experimental results. 


4.1 Dataset 


The method we proposed was validated on a dataset called 
Eanalyst-math. The data comes from a widely used evaluation 
system in China [16], from offline to online, that selects elementary 
school math exercises as experimental subjects. The data collected 
by EAnalyst-math mainly includes homework, unit tests, and term 
tests. Each assignment or evaluation is regarded as a collection of 
exercises, which is more in line with the actual education situation 
in China. EAnalyst-math recorded a total of 525,638 interactions 


from 1,763 students, with an average of 298.1 responses per student. 


The Dataset of EAnalyst 


Number of Students 


° 500 1000 1500 2000 2500 
Number of Exercising Records 


Figure 3. Student-interaction distribution diagram of the 
Eanalyst-math dataset. 


4.2 Settings 

In order to predict students' future response, we can evaluate it by 
classification and regression respectively [17]. From the 
perspective of classification, the area under the Receiver Operating 
Characteristic (ROC) curve AUC and the prediction accuracy ACC 
are used to measure the prediction performance. From a regression 
point of view, we choose Mean Absolute Error (MAE) and Root 


Mean Square Error (RMSE) to quantify the distance between the 
predicted result and the actual response. 


Each data set is divided into 7:3 based on students, 70% is utilized 
for training verification, and 30% for testing. In order to avoid the 
contingency of the evaluation results, we implement the standard 
five-fold cross-validation division for all models and all training 
validation subsets, that is, 80% training set and 20% validation set, 
and exploit the average value as the final comparison result. 


There are many hyperparameters in the model, among which the 
number of hidden units (h), batch_size (b) and learning rate (1) will 
have a greater impact on the results. We conducted many 
experiments to explore the influence of the changes of these 
hyperparameters on the performance of the model, and finally 
found that the performance results were optimal when 1=0.09, h=16, 
and b=16. 


4.3 Results 


4.3.1 Accuracy comparison 

We compare our ERAKT with three other baselines on the dataset. 
The experimental results are shown in Table 2. In general, ERAKT 
has significantly improved AUC and ACC results, MAE and 
RMSE results are significantly lower, which proves that the 
performance of ERAKT is better than the others. Especially, our 
ERAKT performs better the EERNN, a state-of-the-art model 
which including the exercise content information. Next, the 
exercise-aware methods (i.e. EERNN and ERAKT) outperform 
other models that ignore the exercise content (i.e. DKT and 
DKVMN). This experimental result validates the conclusion of 
EKT [11]. 


Table 2. Accuracy comparison 


Method AUC ACC MAE RMSE 
DKT 0.79 0.7301 | 0.2827 | 0.2233 
DKVMN | 0.8783 | 0.8072 | 0.2678 | 0.1346 
EERNN 0.8836 | 0.8213 | 0.2495 | 0.131 

ERAKT 0.9025 | 0.8407 | 0.2203 | 0.1278 


4.3.2 The influence of multi-dimensional features of 


exercises on the prediction results 

In view of the fact that multi-dimensional features will affect the 
performance of knowledge tracing, we explore the impact of 
different features on the performance of the model by adding them 
into the model. The results are shown in Table 3. It can be seen that 
the effect of ERAKT, which integrates multi- dimensional features, 
is significantly better than other models which only add semantic 
features or concepts features. It is worth mentioning that all the 
models we mentioned above perform better than the original DKT 
model. 
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Table 3. The influence of multi-dimensional features of 
exercises on the prediction results 


None DKT 0.79 

DKT+Doc2Vec 0.8266 
Semantics 

DKT+Bert 0.8325 

Concept DKT+Concept 0.83 

Multi- 
dimensional DKT+Bert+Concept 0.8463 
features 


From a semantic point of view, no matter which method (Doc2Vec 
or Bert) is adopted to obtain the exercise representation, the results 
after embedding the method have been greatly improved, which 
shows that the text content of the exercises does have a non- 
negligible impact on the prediction result. From the perspective of 
concepts, the addition of conceptual features, while not much 
improved, was still about 1% higher than the original knowledge 
tracking method. 


In order to better eliminate the influence of the data set division 
ratio on the results, 60%, 70%, 80% and 90% of the data set are 
applied as the training set, and the rest as the test set. As shown in 
Figure 4, the result is consistent with the performance of 70% 
division, with both AUC and ACC increasing and RMSE and MAE 
decreasing as the data set increases, which proves that the increase 
in the training set can Enhance forecasting effect. 


4.3.3 The impact of association on the prediction 


results 
After the experiment proved that integrating exercises 
representation can improve the accuracy of prediction, we then 
designed a comparative experiment to explore the impact of 
exercises association in predicting. 


Table 4. The influence of exercise association on prediction 
results 


RNN LSTM GRU 
0.8463 0.8543 0.8637 


BI-LSTM 
0.9065 


After we integrate exercises representation, we chose different time 

series modeling methods to model and predict student’s responses. 

A comparison of the more commonly applied RNN [18], LSTM [19] 
and GRU [20] with the Bi-LSTM used in our method is shown in 

table 4. It can be seen that RNN is the worst performer, with LSTM 

and GRU in the middle, and Bi-LSTM the best. It proves that 

exercises association has a great influence on the prediction 

performance. 


Figure 5. Auc and Loss convergence fluctuation diagram after 
adding exercise association 


We have drawn the model convergence of the four time series 
modeling methods. It can be seen that the Bi-LSTM fluctuates 
slightly, and the model convergence curves of the other three 
methods are relatively smooth. 


5. CONCLUSION AND DISCUSSION 


In our proposed method, we propose a new context-aware KT 
method that integrates mathematical exercise representation and 
association of exercises, through which we can predict the 
performance of a student on the exercises, thus helping teachers to 
adjust their teaching plans dynamically. 


Experiments have verified the effectiveness and reliability of our 
method. It can be seen from the experimental results that the 
prediction results are significantly improved after integrating multi- 
dimensional features and exercise association. 


As for the exercise semantic representation, Bert can obtain more 
exercise information, which is better than Doc2Vec after 
integrating. This is because Bert realizes the processing of data of 
time series through the attention mechanism and it supports parallel 
computing, which is validated in Bert [8]. In the case of sufficient 
resources, the computing speed of Bert will be much faster than 
LSTM, and the residual network which inside of Bert can prevent 
the network structure from being too complicated. It makes the 
model perform better. 


In the aspect of exercise association, in section 4.3.3, we use four 
different time series modeling methods, in which the Bi-LSTM 
exploited in ERAKT method has the best effect, then GRU’s 
performance is relatively well among the remaining three. This is 
because of the internal structure of them. LSTM and GRU can solve 
the problem of long-term memory and can avoid the problem of 
gradient disappearance in RNN. Compared with LSTM, GRU can 
reduce the risk of over-fitting. Therefore, GRU has the best 
prediction performance and RNN is the worst, this conclusion can 
also be obtained in LSTM [19] and GRU [20]. But all of them three 
can not get the association between exercises. 


The bidirectional structure of Bi-LSTM not only preserves the past 
information, but also the future one. Therefore, all the content can 
be effectively used to obtain the association between the exercises, 
which greatly improves the accuracy of prediction. 


6. FUTURE WORK 


At present, our research has achieved phased results, which can be 
applied in the actual teaching environment to assist teachers in 
teaching activities. Our future work will focus on two aspects: 


(1) Explore knowledge tracking model that integrates multiple 
knowledge concepts, and at the same time integrate the sequence 
association between them. 


(2) Show the students' mastery of each knowledge concept 
systematically. So as to to improve the accuracy of prediction, 
systematically promote it to facilitate the teaching work of teachers. 
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ABSTRACT 


Gaining insight into course choices holds significant value for 
universities, especially those who aim for flexibility in their 
programs and wish to adapt quickly to changing demands 
of the job market. However, little emphasis has been put 
on utilizing the large amount of educational data to under- 
stand these course choices. Here, we use network analysis of 
the course selection of all students who enrolled in an un- 
dergraduate program in engineering, business or computer 
science at a Nordic university over a five year period. With 
these methods, we have explored student choices to iden- 
tify their distinct fields of interest. This was done by ap- 
plying community detection (CD) to a network of courses, 
where two courses were connected if a student had taken 
both. We compared our CD results to actual major special- 
izations within the computer science department and found 
strong similarities. Analysis with our proposed methodol- 
ogy can be used to offer more tailored education, which in 
turn allows students to follow their interests and adapt to 
the ever-changing career market. 


Keywords 
Community detection, higher education, Louvain method, 
bipartite networks, student network, course selection 


1. INTRODUCTION 


University students enter higher education with a plethora of 
courses to choose from on their path to graduation. Gaining 
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insight into student choices holds significant value for uni- 
versities, especially those who aim for flexibility in their pro- 
grams and those who wish to adapt quickly to changing de- 
mands of the job market. For example, the fast rise in pop- 
ularity of machine learning over the past years could impel 
universities to make machine learning and related courses 
readily available to their students. In contrast, more subtler 
trends could be directly identified by the students’ choices 
rather than an obvious shift in the job market. 


Numerous studies based on questionnaires and surveys have 
found that there are various components that contribute to a 
student’s course selection [2, 19, 20]. These are factors such 
as learning value, workload, age and academic performance 
[2]. Of these, the learning value of the course (which refers 
to factors such as intellectual level and interest in the topic) 
has been found to be the most influential factor in course 
selection. Course selection has also been a target in studies 
aiming to understand the gap between student mindsets and 
career demands [20]. Maringe [19] found that although in- 
trinsic interest was important, course choices depend mainly 
on future career goals. According to the author, universities 
may need to adapt their strategies to the idea that students’ 
course choices now seem to reflect their expectations of fu- 
ture employment rather than simply interests. Thus, uni- 
versities would benefit greatly from a deeper understanding 
of the path their students choose towards their degree. 


Educational data mining (EDM) has risen as a new field 
to answer these and other questions about students and 
their learning environment. It utilizes a variety of analytical 
methods and applies them to the vast amounts of data that 
has become available with increased digitization of adminis- 
trative educational information. For example, EDM meth- 
ods have already been applied to try to accurately predict 
college success using common classification algorithms with 
different feature sets [31]. They have also been used to ana- 
lyze student clicking behavior in online courses to determine 
students’ learning strategies and how those strategies can 
have an impact on their learning outcomes [1], as well as to 
predict student dropout [10]. One area of educational stud- 
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ies that has not received much attention is student course 
selection, despite its importance in understanding student 
interests and preparing them for a future career [28]. 


In this paper, we aim to reveal patterns in course selection 
through EDM, providing a new data-driven technique based 
on institutional analytics to gain insight into students’ inter- 
ests that would otherwise be difficult to discern. This knowl- 
edge can then be used for monitoring student interests and 
ensuring that courses reflecting those interests are available. 
We examine whether network analysis applied to students’ 
course data, with a focus on community detection (CD), can 
effectively be used to identify university students’ fields of 
interest. To accomplish this, we use a weighted projection 
network in combination with CD to explore student course 
selection. We focus on communities of elective courses for 
different majors and compare them to some of the official 
specializations the university already has to offer. Deeper 
understanding of students’ choices is a stepping stone into 
allowing students to take more control over their studies, 
improve flexibility in the curricula, and facilitate students’ 
pursuit of their interests. 


2. RELATED WORK 


A promising method for EDM is to represent educational 
data as networks. In general, networks consist of nodes and 
edges, where the nodes can for example represent people, 
countries or cells, and edges represent connections between 
nodes based on factors such as spatial and temporal prox- 
imity or social connections such as friendships [12, 8]. Net- 
work analysis is used to look at internal characteristics and 
the connections and patterns of nodes and edges, providing 
the ability to better understand the fundamental structure 
of networks and the real-life phenomena they model [29]. 
Different methods can be used to analyze networks, for ex- 
ample by looking at structural characteristics such as cen- 
trality, which indicates the importance of any given node in 
the network by assuming nodes that are more central have 
higher control over information passed through the network 
[8]. Community detection is another common way of ana- 
lyzing networks which allows for the aggregation of different 
nodes into communities based on shared characteristics by 
identifying groups of nodes that have a high number of edges 
within themselves but fewer edges to other groups [12]. 


A common application of network analysis in educational 
settings is to understand social connections between stu- 
dents. This has helped reveal the negative effects of student 
interdependence in music education programs and its rela- 
tionship to the program’s friendship networks [26], as well as 
identifying how positive and negative friendship ties emerge 
[27]. Network analysis has also helped clarify the relation- 
ship between students’ social networks and the development 
of their academic success [6, 14]. Furthermore, looking at 
students’ social networks over time, close coequal commu- 
nities are typically formed early on [30], although in some 
cases, students enhance their performance due to social re- 
lations outside their assigned group [24]. 


Although students’ social networks have been studied, the 
exploration of students’ course choices through network anal- 
ysis has few precedents. Within the EDM field, Kardan et al. 
[16] used neural networks to predict course enrollment based 


on various factors such as course and instructor character- 
istics, and course difficulty. Further, Turnbull and O’Neale 
[28] used network analysis with CD and entropy measures to 
explore enrollment in STEM courses at the high school level. 
Among other results, they revealed that indigenous popu- 
lations showed higher levels of entropy in their enrollment 
patterns, which was moderated by adolescent socioeconomic 
status. Neither of these studies focused on detecting student 
interests from course selection patterns. 


3. METHODS 


3.1 Data Source 

Here, we use student and course data from Reykjavik Uni- 
versity (RU). The university offers many different areas of 
study, including preliminary studies, undergraduate and grad- 
uate degrees. Most RU students are undergraduate stu- 
dents, and the RU undergraduate programs also offer the 
most variety of courses. Generally, the majority of RU un- 
dergraduate programs’ courses are mandatory. These are 
the core courses each department decides is essential to their 
study program. The rest of the courses are either free choice 
electives, which can be any course in the university that the 
student qualifies for, or restricted elective courses from a 
selection tailored to the specific major. 


We sample data from all graduated RU students that en- 
rolled in the year 2014 or later and completed undergradu- 
ate programs in engineering, business, or computer science 
(CS) before 2021 (the total number of students was 1481). 
The university offers other programs as well, but we left 
them out since they have fewer students. The variables we 
look at include the student’s registration ID and registration 
semester, the name and semester of each course a student 
has completed, and whether they passed or failed the course. 
We also include each student’s department, major, and type 
of study (undergraduate, graduate, etc.). 


To anonymize the data, we remove anything that could iden- 
tify students, specifically their social security number and 
a numerical registration ID and give them a unique ran- 
dom sequence of numbers to replace both original numbers. 
For each student, we also remove any courses that they had 
de-registered from early in the semester. Further, for each 
major, courses taken by fewer than 5% of students are con- 
sidered outliers and removed. 


3.2 Network Analysis 


3.2.1 Bipartite networks 

We apply network analysis to the data to explore the fields 
of interests of RU students from a data driven perspective. 
Many real-world networks have a bipartite structure, where 
nodes belong to one of two groups or divisions and edges con- 
nect nodes of opposite groups without within-group edges 
[3]. In our bipartite network, the students make up one di- 
vision of the nodes, and courses the other. If a student has 
taken a course, an edge is created between the respective 
nodes. Since edges represent that a student has taken a 
course, there is no edge between two students nor between 
two courses (see Figure 1, left). 


Although bipartite networks give a more realistic and de- 
tailed representation of the system, analyzing them can be 
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Figure 1: From bipartite network to weighted projected net- 
work. Left: a bipartite network, where the blue nodes repre- 
sent courses and the green nodes, students. Right: a unipar- 
tite network has been obtained from the bipartite network, 
where the nodes are courses and the edges have weights that 
determine how many students have taken both courses. 


complex. Therefore we project the bipartite network onto its 
unipartite counterpart (see Figure 1, right) [3]. This leaves 
a network with one type of nodes that can be analyzed with 
typical network methods. The resulting projected network 
consists of nodes representing the courses and edges between 
two nodes indicating that a student has taken both courses. 
We assign weights to the edges to represent the number of 
students who have taken both courses (see Figure 1). 


A base problem with projection of bipartite networks is that 
a lot of important information in the original bipartite net- 
work is lost. Thus, we may end up connecting all courses 
in the network to each other —and form a clique— as long as 
they have at some point been taken by the same student, 
without taking into account how many students connected 
the two courses in the original bipartite network. Here, we 
address this by assigning weights to the edges in the pro- 
jected network [3], where the weights represent the number 
of students who have taken both courses (see Figure 1). 


3.2.2. Community detection 

Building on the weighted projected networks, we use CD 
with the objective of inferring fields of interests in students’ 
course selection. To identify fields of interest, we want to 
emphasise electives. However, in our data set, the informa- 
tion on which courses are mandatory and which are electives 
is incomplete. Mandatory courses along with very popular 
electives appear in the network as hubs, which usually occur 
in real-world networks as nodes with much higher degrees 
and edge weights than the other nodes [4]. We therefore de- 
fine hubs in a data driven way, where a node is a hub if its 
total edge weight is at least one standard deviation above 
the mean edge weight of all nodes. We remove hubs from 
the network based on this definition. 


Next, we apply the Louvain algorithm for CD [7]. This 
is an established, computationally efficient, fast converging 
method that produces accurate communities with high net- 
work modularity, especially in smaller networks [7, 17, 12, 
23]. It has been successfully applied to identify communities 
of intrinsic brain systems [9], and to help create friend lists 
for Facebook users [18]. Modularity, is a measure of edge 
density within a partition (or proposed community) as op- 
posed to edge density between partitions, whereby a higher 
modularity suggests a more cohesive community, separate 
from the others in the network. Importantly for our analysis 
using weighted projected networks, the Louvain algorithm 


can be used both with weighted and unweighted edges. The 
method starts by assigning each node to its own community 
[7], as seen in Figure 2. It then iterates over all nodes of 
the network and assesses the modularity gain obtained by 
assigning the node to the same community as each of its 
neighboring nodes. Next, the node is assigned to the com- 
munity that yields the largest positive modularity gain, or 
maintains its current community if no positive modularity 
gain can be achieved by switching communities. This way, 
each new community assignment brings us closer to optimal 
modularity. The nodes are usually considered multiple times 
and the final iteration is determined when no switch leads 
to a gain of modularity, resulting in optimal partitioning 
of the network. This optimal partitioning is a local max- 
ima, as the result is influenced by which node is considered 
first and the order in which nodes are visited. For some 
communities, we re-apply the Louvain algorithm for more 
detailed results, while using the inter/intra weight density 
ratio described below to ensure our communities maintain 
high quality. 


ay 
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Figure 2: The Louvain algorithm. The first step of the algo- 
rithm is to assign each node to its own community. In step 
2, a random node is selected to start the community aggre- 
gation process. All nodes are visited and allocated to the 
community of one of their neighbors or maintain their cur- 
rent community, depending on which choice gives the highest 
gain in modularity for the network. When no more modu- 
larity gain is possible in the network, step 3 is to aggregate 
the nodes of each community into new super-nodes. Here, 
the numbers given show the sum of node edges within and 
between supernodes. Steps 2 and 3 are then repeated until 
modularity has been optimized, as seen in step 4. 


3.2.3. Community validation 

Although the objective of CD is to split nodes into groups 
based on their connections within versus outside the group, 
there are many more aspects to consider [12]. One impor- 
tant factor is intra-cluster density, which refers to how many 
edges there are within the community as a ratio of how many 
possible edges there could be if all nodes of the community 
were connected to each other. This is contrasted by inter- 
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cluster density, which shows how many edges go from the 
community to the rest of the network as a proportion of the 
maximum possible connections. High intra-cluster density 
may suggest a strong and cohesive community, however if 
it coincides with equally high inter-cluster density, it may 
simply suggest a strong and cohesive overall network. 


To assess the quality of our communities, we use intra and 
inter weight density [13]. This is the same as intra and inter 
edge density previously described, but now accounting for 
weighted edges. The two are defined as follows: 


WDinter => 


where wé* is the sum of edge weights connecting the com- 


munity to the rest of the network, or external community 
edges. We divide this by the estimated total edge weight 
of the network, which shows the edge weight going from 
the community to the rest of the network as a proportion 
of the maximum possible edge weight (assuming that the 
average edge weight of the fully connected network were un- 
changed). Here, w is the average edge weight of the network, 
n is the total number of nodes in the network and nc is 
the total number of nodes within the community. Similarly, 
wit" refers to the sum of edge weights inside the community, 
which is divided by the expected total edge weight within 
the community. We then use a ratio of these two measures 
(W Dinter / WDintra) to obtain the community strength on 
a scale where 0 is the strongest value, indicating a commu- 
nity that is disconnected from the rest of the network, and 
a value of 1 indicates a community equally connected within 
itself as to the rest of the network. We call this measure 
density ratio and use it not only to determine the commu- 
nity strength, but also to ensure that as we create smaller 
and more focused communities, community strength is not 
compromised. 


3.2.4 Comparing communities and specializations 
To further assess the real-world application of the commu- 
nities we detect, we compare them to specializations within 
RU’s Computer Science (CS) department, described in Table 
4 in the Appendix. Any student who pursues an undergrad- 
uate degree in CS at RU has the option to graduate with 
a specialization in a certain field. The specializations do 
not need to be declared at enrollment but any student who 
fulfills the requirements can choose to add this to their grad- 
uation certificate. The specializations offered are Artificial 
Intelligence, Law, Web- and User Experience (UX) Design, 
Sports Science, Game Development and FinTech. Each spe- 
cialization has 2-4 core courses that students need to com- 
plete, along with 1-3 courses from a pool of specialization- 
specific electives. Our approach to defining fields of interest 
is purely through data driven CD. Comparing the detected 
communities with these specializations helps validate the re- 
sults and perhaps provide a reference for the creation of 
new specializations. We compare both the courses in each 
community and specialization, and the number of students 
belonging to a specialization versus those belonging to the 
corresponding community. We define a student as belong- 
ing to a community if they have taken at least 50% of the 
community’s courses, with a special case of two course com- 
munities where both courses have to be completed. 


3.3. Tools 

Aside from the initial retrieval and anonymization of data, 
which we do using C# and SQL, all code for the data anal- 
ysis was written in Python 3.9. We use multiple Python 
libraries to help with the data analysis. For our network 
analysis, we mainly utilize the NetworkX library [21]. For 
more general data manipulation, we use the pandas library 
[22]. We used Gephi for the majority of our network visual- 
ization [5], along with the Matplotlib library [15]. 


4. RESULTS 


4.1 Communities that Reflect Interest Fields 


We conducted CD with the Louvain algorithm on three un- 
dergraduate majors: engineering, business, and computer 
science. These majors have quite different program struc- 
tures and emphases on electives, with the business major 
having the lowest number of elective courses allowed in their 
study plan (four electives). This is followed by the CS major 
with 11 electives and finally engineering, which offers only 
four free electives but nine guided electives” (that is, nine 
electives must be specific to engineering), depending on the 
chosen engineering specialization. 


We first look at the communities for the engineering de- 
partment, see Figure 4 and Table 2 in the Appendix, which 
after hub removal consisted of 81 courses taken by 496 un- 
dergraduate students. Reykjavik University offers various 
undergraduate engineering programs such as biomedical en- 
gineering, financial engineering, and mechatronics engineer- 
ing. These engineering majors all fulfill the same core courses 
in addition to some additional major-specific requirements. 
These majors are quite structured and offer few free elec- 
tive courses. Due to the similarity in the core courses of 
these programs, we group them together into a more gen- 
eral engineering major. This means that the hub removal 
method removed general core engineering courses but leave 
most specialty-specific courses in the network. The result- 
ing engineering network has 81 course nodes and 2614 edges. 
The weighted average inter/intra weight density ratio is 0.24. 
This suggests that hub removal was effective and the aver- 
age community is relatively strong. The communities we 
have detected were eight in total as seen in Table 2. Note 
that communities are named after common characteristics 
between the majority of the courses, even though rarely 
all courses of a community fall within that definition. As 
expected, these communities mainly correspond to the of- 
ficial engineering majors such as financial, biomedical, and 
electrical engineering, with electrical engineering being our 
strongest community (W Dinter / WDintra = 0.05). How- 
ever, we also observe unrelated communities that supersede 
the official majors, such as a community of applied design 
and another for business related courses not mandatory in 
the financial engineering major. Courses in these commu- 
nities are commonly taken together by engineering under- 
graduates, suggesting a common interest not credited to the 
specialized majors. 


There are 334 undergraduate students in our data set who 
majored in business. For this major, the network consists 
of 36 course nodes and 504 edges, with a weighted average 
W Dinter / W Dintra of 0.25, again suggesting strong commu- 
nities, see Figure 5 in the Appendix. This is not unexpected, 
as the business major only allows electives in the final year, 


370 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


giving business students less room to pursue distinct inter- 
ests outside their core subjects. Table 3 in the Appendix 
shows the five communities identified within the business 
major. The strongest community is that of popular courses, 


Figure 3: The network with communities for the BSc program 
in CS. 


Table 1: Community detection results for BSc in CS. 


Community No. courses Density ratio 

@ UX and Business 15 0.25 

@ Engineering 13 0.17 

@ = Web and Software 10 0.20 

© Artificial Intelligence 7 0.39 

© Deprecated Courses I 6 0.08 

® Game Development 4 0.10 

® Deprecated Courses II 4 0.23 
Weighted average 0.21 


which includes the most common electives in the business 
majors along with a handful of newer core courses (W Dinter 
/ WDintra = 0.07). These core courses were recently added 
to the study plan, meaning that they were only mandatory 
for a minority of the students in our data set. This is why 
these core courses were not identified as hubs and removed 
during hub removal. The business major also contains the 
weakest community of all the majors, management (W Dinter 
/ WDintra = 0.71). As the name suggests, this commu- 
nity includes various courses on management, such as service 
management and project management. The low inter and 
intra weight density ratio is interesting, as intuitively these 
courses would seem very connected. This is why measuring 
community strength is vital in determining the importance 
of the detected communities. The other business communi- 
ties are both strong and reflect more specific interests, sug- 
gesting that there are students of the business major who 
actively seek distinct interests despite the program having 
no Official specializations. The last major we explore is CS, 
with 377 students. Computer science has the least struc- 
tured study plan of the three majors, as it puts a higher 
emphasis on unstructured flexibility and free electives. The 
CS course network consists of 59 nodes and 1492 edges. The 
communities (see Figure 3 are the strongest we found, with 
a weighted average W Dinter / W Dintra of 0.21. Most, but 


not all, detected communities seem to reflect an interest in 
a CS sub-field. However, the strongest community we have 
discovered was Deprecated Courses I (see Table 1), which 
represents older courses that may have been core courses 
at some point but are no longer being offered (W Dinter / 
W Dintra = 0.08). We conjecture that this community exists 
as some older students re-register to complete their under- 
graduate degree, for example after previously completing a 
CS diploma or taking a longer study break. It is therefore 
very intuitive that this specific sub-field is combined into 
our strongest community. Aside from communities based 
on deprecated courses, the other communities suggest that 
there is in fact an underlying pattern of interest fields present 
in the CS major, as observed for the other majors explored 
here. 


4.2 RU Communities and Specializations 

As a final validation of the communities we have detected 
for the CS undergraduate major, we now cross-reference our 
results with the actual specializations available for CS stu- 
dents. Unlike the other majors, CS offers a number of spe- 
cializations meant to aid students in pursuing a specific sub- 
field (see Table 4 in the Appendix for a short description of 
each specialization). However, only a subsection of students 
choose to do this. Of the students who graduated between 
2014 and 2020, inclusive, only 9.5% fulfilled the requirements 
for a specialization. A further 13% partially fulfilled a spe- 
cialization’s requirements, by completing at least 60% of the 
specialization’s core courses and 60% of the restricted elec- 
tives needed. 


Comparing the specializations and the communities we de- 
tected (shown in Table 1), we find interesting similarities. 
Our CD reveals that some communities are consistent with 
the specializations, but there is no absolute match. For the 
AI specialization (taken by 11 students, or 29% of those 
who graduated with a specialization), there is a partially 
corresponding community that includes both of the AI core 
courses (Artificial Intelligence and Machine Learning). There 
are 28 students who belong to this community, making it 
more popular than the official AI specialization. Although 
this community does not include any of the other courses 
from the specialization, it does include more theoretical and 
academically demanding courses than most other commu- 
nities, suggesting a reflection of interest in theoretical com- 
puter science in general rather than specifically AI. 


To fulfill the official AI specialization requirement, students 
must complete two core courses and three or more courses 
from a list of specialization-specific electives. However, in 
our data set most of these other electives were removed dur- 
ing either data cleaning (where we removed courses taken 
by fewer than 5% of students) or during hub removal and 
are therefore not part of any community. Interestingly, two 
of the remaining electives overlap between the AI special- 
ization and that of Game Development. Both these courses 
have been sorted by our algorithm into a community that re- 
flects Game Development much more strongly than AI, with 
67 students. This is intriguing, as we know that students are 
much more likely to specialize in Artificial Intelligence than 
Game Development (only one student in our data set fulfills 
the requirements for Game Development), but this indicates 
that the gaming sub-field of Artificial Intelligence may be the 
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biggest area of interest for these students. 


The final specialization for which we discovered a similar 
community is Web and UX design, which was by far the most 
popular specialization taken by students (with 23 students, 
or 64% of all students who had a specialization). While this 
specialization encompasses both web programming and user 
experience, the corresponding community of Web and Soft- 
ware Development (with 84 students) is much more web than 
UX specific. Most of the UX related courses belong to a sep- 
arate community of 21 students that unites UX and business 
rather than UX and web design. This suggests that divid- 
ing the Web and UX design specialization into two distinct 
specializations (Web design and UX design) might be more 
appealing to students. Interestingly, the remaining four offi- 
cial specializations have no corresponding community in our 
results. This was to be expected, as these remaining spe- 
cializations are very rarely pursued by students. That is, 
the communities we have detected are able to represent the 
specializations that students are actually choosing, but did 
not reflect other specializations. This is exactly what we 
expect of CD, with the added bonus of identifying fields of 
interests that may not have been previously considered. 


5. DISCUSSION AND CONCLUSION 

With this project, we aimed to find whether CD could be 
used to effectively identify students’ fields of interest at RU. 
To maintain the scope of the results, we have presented only 
the findings for undergraduate majors in engineering, busi- 
ness, and CS. Our resulting communities vary slightly in 
strength and size, yet almost all of them contain courses 
of a general theme that seem to indicate that they do in 
fact reflect fields of interest. This builds on the results 
found by Turnbull and O’Neale [28], who performed CD on 
a similar school course network, but without hub removal. 
This resulted in much more general course communities that 
demonstrated important but slight differences in the over- 
all majors. In focusing on fields of interests, removing the 
hubs has allowed us to increase the granularity of the result- 
ing communities while still maintaining community strength 
and cohesion. However, one of the commonalities between 
these majors is that the largest community detected usually 
included the major’s most popular courses, be that electives 
or new mandatory courses our hub removal does not con- 
sider. As Fortunato [13] suggested, using the inter/intra 
weight density, we were able to evaluate the quality of the 
communities that were detected with the Louvain algorithm. 


The communities we have discovered encapsulate various 
distinct areas of interests for the different undergraduate 
majors RU has to offer. Additionally, for the CS depart- 
ment, we have verified that the detected communities also 
reflect the main areas students choose to specialize in, which 
further validates our findings. To our knowledge, applying 
CD in this way and for this purpose has not been done be- 
fore. This provides an exciting new tool for universities to 
better understand their students’ aspirations. 


In improving knowledge of student course selection, we pro- 
vide academic institutions with more tools to increase study 
flexibility for their students. This knowledge can then be 
used to decide which courses the university wants to of- 
fer. This knowledge is also useful for academic counselors 


when helping students to discover their own field of inter- 
est. Based on previous studies, we assume that interest is 
the main motivation behind course choices [2, 19]. However, 
these communities may be based on other factors. Exam- 
ining the characteristics of courses that make up different 
communities might reveal other factors that contribute to 
course selection, such as course difficulty, grading, teacher 
characteristics, and more [25, 2, 19]. 


Although we were able to successfully apply network analy- 
sis to our student and course data, there were a few setbacks. 
One drawback in our analysis is the fact that although RU’s 
administrative data has largely been digitized, this has not 
always been done in the most structured and data-mining 
friendly way. For example, all information on specializa- 
tions was retrieved directly from RU’s website and format- 
ted manually, as this information is not stored in the univer- 
sity’s data warehouse. Reliable information on the manda- 
tory courses of each major was also not available, which was 
why we decided to use data driven hub removal. Improving 
data availability, centrality and consistency is currently a 
priority at RU, but should also be considered by other uni- 
versities wanting to take full advantage of EDM methods. 


Our findings show that network analysis with CD is a useful 
tool in understanding students’ course selection. The course 
choice patterns found here can still be explored further. For 
example, the current results are based on data from stu- 
dents who enrolled in the same program at different times. 
Thus any small changes in the program structure between 
years can introduce noise in the data. Looking at individ- 
ual registration years, perhaps including a larger university 
with more students, could give clearer results. Further, it 
would be interesting to repeat the same analysis over sepa- 
rate periods to discover changes in interest fields over time. 
Finally, it was out of the scope of the current paper to an- 
alyze trends based on more detailed characteristics such as 
gender, age or grades. Augmenting the communities with 
these factors could for instance provide a tool to identify 
differences in choices made by students who graduate suc- 
cessfully and those who struggle more with their studies, 
perhaps yielding an opportunity for early intervention. 


Educational data mining is an exciting new field with the 
potential to greatly influence educational institutions and 
their students going forward [11]. This project aimed to re- 
veal how network analysis could be used to enhance student 
course selection by improved understanding of students’ aca- 
demic interests. Our analysis has successfully led to mean- 
ingful results that could easily be replicated by most inter- 
ested universities with digitized information. Coupling this 
increased understanding of student interests with added aca- 
demic support gives universities the tools to raise flexibility 
within majors while maintaining educational quality. Hope- 
fully, this and other research in the field can be used to offer 
more tailored and student-led education, which in turns al- 
lows students to follow their interests and easily adapt to 
the ever-changing demands of the job market. 
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Figure 4: The network with communities for the BSc program 


in engineering. 


Figure 5: The network with communities for the BSc program 


We 


in business. 
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Table 2: Community detection results for BSc in engineering. 


Community No. Density 
courses ratio 
@ Comp Sci and Mechatronics 25 0.37 
@ Engineering Management 15 0.16 
@ Finances and Management 10 0.25 
© — Biomedical Engineering 10 0.10 
® Financial Engineering 9 0.21 
@ = Electrical Engineering 5 0.05 
© Applied Design 4 0.29 
“> Business 3 0.32 
Weighted average 0.24 


Table 3: Community detection results for BSc in business. 


Community No. courses W Dinter /W Dintra 
@ = Popular Courses 15 0.07 
@ = Management 6 0.71 
@ Finance 6 0.29 
@® = Operations 5 0.10 
© — Asset Management 4 0.36 
Weighted average 0.25 


Table 4: Offici: 


Name 


al specializations in the CS program. 


Description 


Artificial 
intelligence 


Game design 


FinTech 


Web and UX 
design 


Psychology 


Law 


Sports science 


Core courses reflecting an interest in AI 
and machine learning, with electives fo- 
cused on game development and analytical 
skills. 

Core courses encompass game development 
in general, computer graphics and game 
engine architecture. Electives reflect more 
general programming skills and AI. 

Both core courses and electives focus on 
the financial part of the Financial Technol- 
ogy discipline, as all students taking these 
courses gain software development skills 
from the core courses of the CS major. 

As the name suggests, most courses for this 
specialization directly relate to either web 
programming (such as the courses Web 
Programming II and Web Services) or user 
experience (User-Focused Software Devel- 
opment, Human-Computer Interaction). 
Core courses in psychology that emphasize 
cognitive processing and research method- 
ology. Any other psychology courses can 
then be chosen as electives. 

General law courses with some emphasis 
on intellectual property rights and negoti- 
ations. 

General sports science courses. 
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ABSTRACT 


The current study explores the ability to predict argumentative 
claims in structurally-annotated student essays to gain insights into 
the role of argumentation structure in the quality of persuasive 
writing. Our annotation scheme specified six types of 
argumentative components based on the well-established 
Toulmin’s model of argumentation. We developed feature sets 
consisting of word count, frequency data of key n-grams, 
positionality data, and other lexical, syntactic, semantic features 
based on both sentential and suprasentential levels. The 
suprasentential Random Forest model based on frequency and 
positionality features yielded the best results, reporting an accuracy 
of 0.87 and kappa of 0.73. This model will be included in an online 
writing assessment tool to generate feedback for student writers. 


Keywords 


Argumentation, Claim identification, Argumentative writing 


1. INTRODUCTION 


Written argumentation has been an important area of study for 
many years [43, 45]. Recent developments in natural language 
processing (NLP) have introduced new approaches to 
automatically detect the discourse structure of argumentative 
essays [7, 8, 9, 10, 26, 33, 34, 38, 44, 45]. These studies have shown 
that content (i.e., lexical, syntactic, and semantic) and structural 
features (i.e., the positionality of tokens, sentences, and paragraphs) 
are effective in detecting discourse elements. 


Researchers have used fixed discourse markers at the word and 
phrase levels [5, 12, 18, 42] as indicators of different argumentative 
structures. This approach has been applied in discourse [17, 19, 22] 
and NLP analyses [7, 8, 9, 47]. These studies generally identify 
relations between discourse markers and their functions according 
to the conceptual framework of conjunctive relations [36]. For 
instance, phrases such as in summary and in conclusion are 
associated with the discourse function of ‘summarizing’ an 
argument. Such discourse markers have been used to identify the 
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attributes of the structural elements in argumentative essays [8, 9, 
36, 37]. For example, Burstein et al. [7] annotated structural 
information of argumentative essays collected from TOEFL, GRE, 
and GMAT. Discourse markers indicating each of the 
argumentative functions were extracted automatically from the 
essays. A word list that contained the discourse markers and their 
corresponding argumentative functions was formed and used to 
automatically predict instances of argumentation. Similarly, Palau 
and Moens [37] implemented a context-free ruled-based approach 
for argumentation mining in legal texts. They focused on and 
developed rules based on common expressions encountered in the 
legal documents such as for these reasons, in light of all the 
material, and discourse markers, such as however or furthermore. 
Using this approach, they obtained accuracy of approximately 0.6 
in detecting the argumentation structures, while maintaining F1- 
measure of around 0.7 for recognizing premises and conclusions in 
legal texts. 


In more recent work, Stab and Gurevych [44, 45] provided publicly 
available corpora comprising students’ argumentative essays and 
annotation guidelines for parsing argumentations. In these corpora, 
the essays were annotated based on three major argumentative 
categories: major claim, claim, and premise. They then used lexical, 
structural, syntactic, discourse markers, and other features to 
identify argument components. The lexical features consisted of 
binary lemmatized unigrams and the 2,000 most frequent bigrams 
extracted from a training corpus. The structural features captured 
the position of components in the text and the number of tokens in 
those components. Discourse markers included logical connectives 
such as therefore, thus, or consequently and the use of first-person 
pronouns (which indicated major claims). The syntactic features 
included part of speech (POS) distributions, number of sub-clauses, 
and the tense of the main verb. Using support vector machine 
models, Stab and Gurevych [45] found that a combination of all 
these features yielded an Fl score of 0.77. Khatib et al. in [3] 
employed a classifier for argumentativeness based on the research 
in [37, 44, 45], and evaluated its performance on student essays 
from [44]. Khatib et al. used n-grams, syntax, discourse makers and 
part of speech (POS) features in an argument. Their results 
indicated that a combination of n-grams, POS tags, and syntax 
features yielded accuracy of 0.64, 0.62, and 0.59 on classifying 
arguments in students' essays, while the full feature set model 
yielded an accuracy of 0.67. Though only unigram through tri- 
grams were included in the POS feature. 


Though the use of discourse markers, n-grams and POS as 
indicators has been common in the detection of argumentative 
elements, few studies have examined whether using longer 
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sequences of n-grams (beyond tri-grams) and their POS tags would 
contribute to identifying argumentative features. We also note that 
other types of linguistic features related to lexical, structural, 
cohesion, and affective features were not tested in previous studies 
[e.g., 37, 45]. Therefore, this study explores a wider range of NLP 
features, and examines their contribution to model accuracy. We do 
so specifically on a corpus of student essays annotated on 
theoretically-aligned classifications of argumentative elements 
expected in academic settings. This is in contrast to most of the 
existing corpora in English that are annotated for argumentative 
structures and are from the domains of law [e.g., 4, 37], biology and 
medicine [e.g., 20], and user-generated content, e.g., Wikipedia 
articles or debate data, see [1, 2, 27, 41]. Few corpora [44, 45] have 
been developed for argumentation mining in the educational 
settings. In this study, we build on Stab and Gurevych’s work [44, 
45] by developing a structurally annotated corpus based on the 
Toulmin model [46] of argumentation that better reflects the 
structure of student essays. Our objective is for the corpus — and the 
models of argumentation developed from the corpus — to contribute 
to the development of writing assessment tools that can deliver 
useful feedback to student writers. 


Thus, in this study, we introduce a new corpus of essays annotated 
for argumentative features. We then develop NLP approaches to 
automatically identify claims in structurally annotated 
argumentative essays using length, frequency data of significant n- 
grams and POS tags, positionality data, and a wide range of lexical, 
syntactic, cohesion, and cognitive features extracted from a number 
of NLP tools [14, 15, 25, 24]. We compared the identification 
accuracy of multiple machine learning classifiers using different 
types of derived features at different levels (based on sentences or 
argumentative elements that are suprasentential). Our goal is to 
better understand whether and how the selection of the linguistic 
features, the level of units for identification (both sentential and 
suprasentential), and the choice of classifiers influence the 
accuracy of claim identification. Finally, we conduct an error 
analysis of the best model and discuss the distribution of the 
misclassification instances and related features. This study is 
guided by the following research question: 


To what extent do length, frequency of significant n-grams (and 
POS tags of n-grams), lexical, syntactic, and semantic features, and 
positionality predict argumentative claims in essays? 


2. METHOD 
2.1 Corpus 


For the analysis, we annotated 314 persuasive essays. The essays 
were written by undergraduate students (NV = 314) at a public 
university in the United States who were native speakers of English. 
Two prompts from retired test banks of the Scholastic Assessment 
Test (SAT) were used. The prompts were counterbalanced such that 
half of the students wrote about ‘originality and uniqueness’ while 
the other half wrote about ‘heroes versus celebrities.’ All essays 
had been scored previously by expert raters for holistic writing 
quality. For each essay, we extracted the average number of letters 
per word, the number of words, number of types, type-token ratio, 
average number of words per sentence, the number of sentences 


' https://www.tagtog.net 


and paragraphs. Descriptive statistics for these items of the 314 
essays are reported in Table 1. 


Table 1. Descriptive statistics of the persuasive essays 


Mean SD Median Range 
Letters per word 4.52 0.24 4.51 1.50 
Number of words 354.46 118.20 344.00 680.00 
Number of types 178.17 50.01 173.00 279.00 
Type-token ratio 0.52 0.07 0.52 0.41 


Words per sentence 17.74 4.30 17.06 35.08 
Number of sentences 20.65 742 20.00 48.00 
Number of paragraphs 3.86 1.38 4.00 7.00 


2.2 Annotation of argumentative elements 

The essays were structurally annotated by normed raters for 
argumentative elements. We used the modified Toulmin models 
[46] presented in [35] and [30] as the basis for the annotation rubric. 
The rubric adopted six elements (i.e., micro-categories) as the 
building blocks of the argumentation framework: Final Claim, 
Primary Claim, Counterclaim, Rebuttal, Data, and Concluding 
Summary. The definitions of each of these elements are presented 
in Appendix A. 


The essays were coded by two annotators on the web-based text 
annotation platform ‘Tagtog’!. The two annotators were both native 
speakers of English and were undergraduate students majoring in 
applied linguistics at a public university in the United States. Before 
independent annotation, a norming process was conducted to help 
ensure consistency in annotations. Once normed, the two 
annotators worked independently and coded the 314 essays in the 
opposite order to avoid recency effects. 


The two annotators made decisions on both the boundary of an 
argumentative element and the category of the element. An 
argumentative element was inherently suprasentential (i.e., 
according to the annotation scheme derived from the norming 
session, it could contain one or more sentences, and the content 
could be over the span of paragraphs). Inter-rater reliability 
calculated using Fleiss’s Kappa for all the annotations was 0.584 (p 
< 0.001), indicating fair to good agreement [16]. Disagreements of 
either boundary or category of the argumentative elements between 
the two annotators were adjudicated by an expert adjudicator who 
had years of experience teaching and conducting writing research. 
In the case of disagreement, the expert adjudicator compared the 
annotations from both annotators and made the final decision for 
both the boundary and the category of the argumentative element. 


The current study focuses on the identification of claims versus 
non-claims, mainly because of the small sample size of the corpus 
and the distribution of micro-categories. Thus, we combined the 
categories of Final Claim, Primary Claim, Counterclaim, and 
Rebuttal into a single category of claims. The remaining categories 
of Data and Concluding summary were classified as non-claims as 
was any non-annotated text. 
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2.3 Training and test sets 

Annotation of the data led to the classification of 2264 
argumentative elements. As mentioned in Section 2.2, the 
argumentative elements were inherently suprasentential. We 
further split the elements into sentences to determine whether this 
influenced accuracy. All sentences from the same argumentative 
element were given the same annotation as the original category 
(i.e., claims or non-claims). We thus had two data sets: 1) a 
sentence-tokenized data set (NV = 6326) and 2) a suprasentential 
data set (N = 2264). We randomly selected 70% of the 
argumentative elements as the training set, and the remaining 30% 
of the elements as the test set for both datasets. We report the 
number of argumentative elements, and number of claims and non- 
claims for the datasets in Table 2. 


Table 2. Numbers of elements, claims and non-claims for the 
training and test sets 


Number Number 
Number 
Data set of ; of non- 
of claims ; 
elements claims 
Suprasentential training set 1594 639 955 
Suprasentential test set 670 267 403 
Sentential training set 4401 935 3466 
Sentential test set 1925 409 1516 


2.4 Features 
2.4.1 Word count 


We extracted the number of words for each claim and non-claim at 
the sentential and suprasentential level. 


2.4.2 N-gram frequency 

We extracted n-grams and the POS combinations of these n-grams 
for both claims and non-claims. We assume that some n-grams (or 
POS n-grams) are more likely to identify claims versus non-claims 
(and vice versa), and the frequency of these key n-grams (or POS 
n-grams) could serve as good indicator of the type of an 
argumentative element or sentence. We used keyness values [21] 
as the measurement of importance of the n-grams or POS n-grams 
in claims and non-claims. Keyness values can provide evidence of 
whether n-grams and POS n-grams are more common in one corpus 
as compared with the other corpus. In the current study, we treated 
the claims and non-claims as two separate corpora. 


Raw and normalized frequency (i.e., normalized by the total 
number of words in all claims and non-claims, respectively) for 
each n-gram (or POS n-gram) that occurred both in claims and non- 
claims were calculated. The keyness value of each n-gram was also 
calculated based on the frequency data following Rayson and 
Garside’s guidelines [40]. Specifically, if an n-gram or POS n-gram 
had a keyness value greater than 3.84 (equivalent to p < 0.05), and 
if it had a higher normalized frequency in claims, it was considered 
more likely to occur in claims over non-claims, and vice versa. The 
range of the n-grams and POS n-grams was from unigram to seven- 
grams. NLTK [6] was used to tokenize the texts into n-grams and 
label the POS for the n-grams. For example, the following phrases 
should be, would be, can be, and will be were converted to the same 
POS n-gram combination: MD (modal) + VB (verb base). We did 
not remove stopwords before n-gram tokenization. For each 
suprasentential and sentential argumentative element in the training 
and test sets, we calculated the frequency of each type of the 


significant n-grams or POS n-grams (e.g., bigrams that were 
significant in claims), and normalized the frequency by the length 
(word counts). 


2.4.3 Positionality of the elements 

Beyond n-gram frequency, studies have shown that, the position of 
argumentative elements is an indicator of their structural function 
[e.g., 7, 8, 10]. In this study, two types of normalized positional 
variables for each argumentative element or sentence were 
calculated as positionality features. 


Normalized element or sentence position in an essay was computed 
as the ratio of the element/sentence position in an essay to the 
number of elements/sentences in the essay (e.g. if an 
argumentative element or a sentence was the 5" element or 
sentence in an essay of 10 elements/sentences in total, the value of 
this variable would be 5 divided by 10, or 50%). The normalized 
position of the element or sentence in a paragraph was computed as 
the ratio of the element/sentence position in a paragraph to the total 
number of elements/sentences in that paragraph. That means, if an 
argumentative element or a sentence was the 2"! element (sentence) 
in a paragraph, in which there were 5 elements (sentences) in total, 
the value would be 2 divided by 5, or 40%). 


2.4.4 Other lexical, syntactic, and semantic features 
To explore whether additional lexical, syntactic, cohesion, and 
cognitive text features increased the accuracy in identifying claims 
and non-claims, we extracted 925 features for each of the 
argumentative elements. These features were extracted using the 
Suite of Automatic Linguistic Analysis Tools (SALAT) [14, 15, 25, 
24]. SALAT includes multiple NLP tools including TAACO (Tool 
for the Automatic Analysis of Cohesion), TAALES (Tool for the 
Automatic Analysis of lexical Sophistication), TAASSC (Tool for 
the Automatic Analysis of Syntactic Sophistication and 
Complexity), and SEANCE (Sentiment Analysis and Cognition 
Engine). Two-sample t-tests or Wilcoxon’s tests were conducted 
using the variables after removing SALAT variables that were not 
normally distributed. We then removed those variables where the 
results of t-test or Wilcoxon’s test were not significant between the 
group of claims and non-claims. Finally, by visual inspection, 20 
out of 131 variables that were relevant to argumentative elements 
were selected. Hand selection of variables was done to avoid 
problems of overfitting. The selected NLP features and their 
descriptions are presented in Appendix B. 


2.4.5 Feature reduction 

To avoid multicollinearity, we conducted correlation analyses 
among all the derived features (one versus all) for the two training 
sets, respectively. If two or more variables correlated with r > 
0.699, the variable(s) with the lower correlation with the category 
of the argumentative element/sentence were removed, and the 
variable with the higher correlation was retained. The feature 
reduction process was done on the two training sets first and then 
applied to the test sets. After feature reduction, the frequency 
features that were retained included word count (of the 
argumentative element or sentence), the frequency of the 
significant unigram in claims and in non-claims, bigrams and quad- 
grams in claims, and the frequency of significant POS unigrams, 
trigrams, four-grams, five-grams in claims and in non-claims, and 
frequency of significant six-grams in claims. The two positionality 
features and the selected 20 SALAT features were also retained. 
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Table 3. Model accuracy results 


Classifier Model Accuracy Kappa Label Precision Recall Fl 
Suprasentential - Frequency 0.852 0.691 Non-Claim 0.874 0.881 0.878 
and positionality : : Claim 0.818 0.809 0.814 
: Non-Claim 0.867 0.876 0.872 
ieasiie Suprasentential - Full features 0.845 0.675 Claim 0.810 0.798 0.804 
Regression —_Sentential - Frequency and 0.802 0.216 Non-Claim 0.817 0.965 0.885 
positionality , : Claim 0.604 0.198 0.298 
: Non-Claim 0.823 0.951 0.882 
Sentential - Full features 0.800 0.244 Claim 0.569 0.242 0.340 
Suprasentential - Frequency 0.769 0.485 Non-Claim 0.747 0.931 0.829 
and positionality : ‘ Claim 0.833 0.524 0.644 
: Non-Claim 0.831 0.878 0.854 
Naive Suprasentential - Full features 0.819 0.618 Claim 0.799 0.730 0.763 
Bayes Sentential - Frequency and 0.791 0.267 Non-Claim 0.834 0.925 0.878 
positionality : : Claim 0.515 0.301 0.380 
P Non-Claim 0.833 0.916 0.872 
Sentential - Full features 0.789 0.271 Claim 0.506 0.318 0.390 
Suprasentential - Frequency 0.836 0.650 Non-Claim 0.835 0.906 0.869 
and positionality : : Claim 0.837 0.730 0.780 
P Non-Claim 0.760 0.943 0.842 
K-Nearest Suprasentential - Full features 0.787 0.526 Claim 0.865 0.551 0.673 
Neighbors —_ Sentential - Frequency and 0.818 0.286 Non-Claim 0.827 0.973 0.894 
positionality . 7 Claim 0.709 0.245 0.364 
: Non-Claim 0.813 0.976 0.887 
Sentential - Full features 0.804 0.196 Claim 0.654 0.166 0.265 
Suprasentential - Frequency 0.863 0.714 Non-Claim 0.886 0.886 0.886 
and positionality , 4 Claim 0.828 0.828 0.828 
P Non-Claim 0.865 0.856 0.860 
Support Suprasentential - Full features 0.833 0.652 Claim 0.786 0.798 0.792 
.. __ Sentential - Frequency and Non-Claim 0.839 0.951 0.891 
See sc osionAlity Bei OPES chai 0.639 0.325 0.431 
: Non-Claim 0.833 0.968 0.896 
Sentential - Full features 0.822 0.320 Claim 0.706 0.281 0.402 
Suprasentential - Frequency 0.873 0.734 Non-Claim 0.886 0.906 0.896 
and positionality : : Claim 0.853 0.824 0.838 
? Non-Claim 0.890 0.886 0.888 
Random Suprasentential - Full features 0.866 0.720 Claim 0.829 0.835 0.832 
Forest Sentential - Frequency and 0.832 0.419 Non-Claim 0.858 0.943 0.898 
positionality . , Claim 0.664 0.421 0.515 
. Non-Claim 0.850 0.951 0.897 
Sentential - Full features 0.829 0.390 Claim 0.672 0.377 0.483 


To examine whether adding the SALAT features improved the 
accuracy of claim identification, we created two versions of the 
feature sets. The first version comprised the n-gram frequency 
(including word count) features and positionality features, and the 
second version comprised all the features (including the SALAT 
NLP features). Combined with the different levels of discourse 
units (sentential and suprasentential), four pairs of datasets 
(training and test sets) were prepared for modeling: the frequency 
and positionality versions along with the full feature versions at 
both the sentential and suprasentential levels. 


2.4.6 Classifiers 

We used the ‘caret’ [23], ‘randomForest’ [28], ‘e1071’ [32], and 
‘tidyverse’ packages [48] in R [13] to apply Logistic Regression, 
Naive Bayes, K-Nearest Neighbors, Support Vector Machines, and 
Random Forest models. 10-fold cross validation with five repeats 
was used. We trained and tested the four versions of data separately. 


For the SVM classifier, a linear, polynomial, and radial kernel was 
applied. The model with the best performance was selected to make 
predictions on the test set. 


3. RESULTS 


3.1 Model evaluation 

The classification performances (precision, recall, Fl scores, 
accuracy, and Cohen’s kappa) of the multiple models on the test 
sets are reported in Table 3. 


Overall, the models developed on frequency and positionality 
features slightly outperformed the models developed using all the 
features. This indicates that adding lexical, syntactic, cohesion, and 
cognitive NLP features does not improve the accuracy of the 
classification of claims and non-claims. In terms of the selection of 
the unit of classification, the suprasentential models outperformed 
the sentential models. Finally, the suprasentential Random Forest 
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model based on frequency and positionality features yielded the 
best accuracy (0.873) and Kappa (0.734), followed by the 
suprasentential model based on the full feature set, which yielded 
an accuracy of 0.866 and Kappa of 0.720, which represents good 
performance based on the scale of Cohen’s Kappa values [11]. 


3.2 Important variables 

Variable importance for the best model (the suprasentential 
Random Forest model based on word count, n-gram frequency and 
positionality features) was reported by the ‘caret’ package. Table 4 
shows the top 10 important variables and their importance values 
for this model. 


The variable importance values showed that the length (word 
count) of an argumentative element, the normalized position of the 
argumentative element in the essay, and the frequency of 
significant bigrams in claims in the argumentative element are the 
three most important variables. 


Table 4. Variable importance values 


Variable umponaniee 
Value 

Word Count 289.988 
Normalized element position in the essay 162.083 
Frequency of significant bigrams in claims 47.992 
Frequency of significant unigrams in claims 31.147 
Normalized element position in the paragraph 29.791 
Frequency of significant POS five grams in claims 28.465 
Frequency of significant POS four grams in claims 27.389 
Frequency of significant unigrams in non-claims 25.399 
Frequency of significant POS unigrams in claims 25.272 
Frequency of significant POS unigrams in non- 93.812 
claims 
Frequency of significant POS trigrams in claims 20.364 
Frequency of significant POS trigrams in non- 

; 18.375 
claims 
Frequency of significant POS four grams in non- 

é 13.745 
claims 
Frequency of significant four grams in claims 8.676 
Frequency of significant POS six grams in claims 8.490 
Frequency of significant POS five grams in non- 4.210 


claims 


4. ERROR ANALYSES AND DISCUSSION 


We conducted error analyses for the two Random Forest 
suprasentential models (i.e., the models based on the frequency and 
positionality feature set and the full feature set). Our goal was to 
examine the misclassifications of the models to better understand 
elements that may contribute to model accuracy. 


We first examined classification rates. Among all incorrectly 
classified instances, we found more cases in which a claim was 
misclassified as a non-claim, whereas non-claims were less 
frequently misclassified as claims. For both models, around 17% of 
claims were misclassified and non-claims, and around 10% of non- 
claims were misclassified as claims. These results indicate that, the 
models are better at identifying non-claims than claims, potentially 


due to the imbalanced data between the claims and non-claims. 
Nevertheless, future studies should examine if there are more 
representative features in claims that can be integrated into our 
current feature set. 


We next examined if essay quality and length influenced the model 
accuracy. Specifically, for each argumentative element in the two 
suprasentential test sets, we extracted the following information: 
holistic score, number of words, number of sentences, and number 
of paragraphs in the essay where the argumentative element 
occurred. We examined differences between the argumentative 
elements that were correctly and incorrectly predicted for these 
features using t-tests. No differences were reported for essay 
quality and length in either model. Thus, the classification of 
argumentative elements was not related to the quality or the length 
of essays. 


We also examined if differences in model accuracy were related to 
more specific argumentation categories (i.e., micro-categories). As 
mentioned in Section 2.2, we merged the argumentation categories 
of Primary Claim, Final Claim, Counterclaim, and Rebuttal from 
the original annotated corpus into a larger classification of claims 
(i.e., a macro-classification). We also classified the remaining 
categories of Data and Concluding Summary along with Non- 
annotated texts into non-claims. To assess whether the micro- 
categories influenced classification of the macro-classification, we 
compared the prediction accuracies among the seven micro- 
categories. 


The results showed that Counterclaims were not misclassified in 
either model (likely because of their rarity), Concluding Summaries 
were not misclassified in the frequency- and positionality-based 
models, but misclassified 3.9% of the time in the full feature model. 
Data was misclassified around 9% in both models. Meanwhile, the 
sub-categories that were more frequently misclassified included: 
Primary Claims (around 14 misclassified), Final Claims (around 
21% misclassified), Non-annotated texts (around 22% 
misclassified), and Rebuttal (2 out of 3, 66.7% misclassified 
instances in both models). These results were also in line with 
findings that claims were more frequently misclassified as non- 
claims. 


To further explore what factors affect the misclassifications among 
the micro-categories of argumentative types, Welch’s t-tests were 
conducted among all NLP features (see Appendix B) used in the 
full analysis between correct and incorrect classification instances. 
However, the analysis was done for the sub-category of 
Counterclaim since all instances under this category were correctly 
predicted by the two models. Also, we did not conduct t-tests for 
the micro-category of Rebuttal due to a small sample size (N =3). 


Table 5 presents the features for which significant differences were 
found between the correct and incorrect classification instances in 
at least two categories of argumentative types. In general, the 
classification of Primary Claim, Data, Concluding Summary, and 
Non-annotated texts seemed to be more strongly influenced by 
linguistic features. Word count was the strongest indicator of 
misclassification, in which difference were found for each micro- 
category. The standard deviation of dependents per object of 
prepositions was another strong predictor of misclassification, 
which reflects the development of syntactic complexity [25]. 
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Table 5. Features with significant differences between correct and incorrect classification instances 


1 2 3 4 5 


Primary claim Yes Yes Yes 
Final claim Yes 
Data Yes Yes Yes Yes Yes 
Concluding summary Yes Yes 
Nonannotated Yes Yes Yes 


6 7 8 9 10 11 12 13 14 

Yes Yes Yes Yes Yes 
Yes Yes 

Yes Yes Yes Yes Yes Yes 
Yes Yes Yes Yes Yes 

Yes Yes Yes Yes 


Note. Shaded gray cells with ‘Yes’ indicate significant difference (p < .05) were found between the correct and incorrect instances. 1 = 
Number of named entities, 2 = Word count, 3 = Normalized element position in the paragraph, 4 = Normalized element position in the essay, 
5 = Frequency of significant unigrams in claims, 6 = Frequency of significant POS trigrams in claims, 7 = Frequency of significant quad- 
grams in claims, 8 = Hu Liu proportion score, 9 = Objects component score, 10 = Brown frequency score, 11 = Bigram lemma type-token 
ratio, 12 = Nouns as modifiers score, 13 = Dependents per object of the preposition (SD), 14 = T-units per sentence. 


The number of named entities was a strong indicator for the non- 
claims, wherein the incorrect instances of non-claims contained 
fewer named entities versus the correct instances. The nouns as 
modifier scores were also predictive of misclassification, which 
measured the use of nouns as nominal modifiers in general and the 
variation in the number of modifiers per nominal [25]. Other 
linguistics features that influenced the classification accuracy 
included: the normalized position of the element in paragraph and 
in essay, the bigram type-token ratio, the frequency of key unigram, 
quad-gram, and POS trigram in claims, the number of T-units per 
sentence, the number terms that reference objects, the proportion of 
the number of words with positive sentiments to the words with 
negative sentiments, and the mean frequency score based on 
London-Lund Corpus of Conversation. 


5. CONCLUSION 


In this study, we proposed an approach that combined the 
frequency, positionality, and other lexical, syntactic, cohesion, and 
cognitive NLP features to predict claims and non-claims in 
argumentative essays. Our model performed well in the 
classification of these argumentative elements. Our exploration of 
the features, the comparison between  sentential versus 
suprasentential models, and investigation of the factors that 
influenced classification accuracy in the error analyses should 
contribute to the field of automated identification and evaluation of 
discourse elements in argumentative writing. 


It is important to note that the corpus used for this study was 
relatively small, comprising 314 student essays. Thus, to gain 
higher accuracies and reliabilities in classifying argumentative 
elements, we plan on annotating more essays and expanding the 
current corpus. That also means we will use essays written to more 
prompts allowing us to extract key n-grams and POS n-grams that 
are more generic and less restricted to the specific prompts used 
here. In addition, due to the small sample size, our classification of 
argumentative elements was simplified to focus on claims versus 
non-claims. We are interested in exploring the classification of the 
micro-categories (Primary Claim, Final Claim, Counter Claim, 
Rebuttal, Data, and Concluding Summary) in a larger corpus. We 
also plan to include the prediction of the quality of these 
argumentative elements in students’ writing. 


The models developed in this study will be included in an online 
Writing Assessment Tool (WAT). Implementing the classification 
algorithm within WAT, WAT’s automatic writing evaluation 
(AWE) system will have the capacity to predict the number of 
claims in the essay and whether the claims mention the key n-grams 
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that reflects the argument topic. This will afford providing feedback 
to students on argumentation quality within student essays. The 
study also provides insight into the length, position, content (e.g., 
the key n-grams), and other NLP features in claims versus non- 
claims in students’ writing, which will contribute to finer-grained 
feedback components in our AWE system. 


This study also provides important information for others who are 
developing AWE algorithms to drive feedback on argumentative 
essays, or more broadly to better understand the use of claims in 
essays. Specifically, the results of this study inform features related 
to feedback that can be provided to students about the number of 
claims, mentioning the argument topic, how to better position 
argumentative elements within their essays, and how to pay 
attention to specific linguistic features (such as the use of named 
entities when giving evidence) in their writing. This is an important 
achievement in the realm of writing feedback given the crucial need 
to automate feedback to students on their use of claims and 
evidence in argumentative essays. 


Another important contribution of this study is that we also 
introduce a new corpus of essays annotated for argumentative 
elements, which is made publicly available at 
linguisticanalysistools.org. This corpus includes theoretically 
aligned argumentative elements that complement existing corpora 
[44, 45] and adds new components including prompts, holistic 
scores, additional categories of argumentation, and different 
educational settings. As such, this study provides the opportunity 
for other scientists to build upon our work such that we can better 
understand writing, and the features related to successful 
composition. 
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APPENDIX 


A. Definitions of argumentative elements 


Elements Definitions Examples 
: : An opinion or conclusion on the In my opinion, every individual has an obligation to think seriously about 
Final Claim . : F Na ae 
main question important matters, although this might be difficult. 
Actin tat sappane the deal The next reason why I agree that every individual has an obligation to 
Primary Claim asia think seriously about important matters is that this simple task can help each 
: person get ahead in life and be successful. 
A claim that refutes another claim or Some may argue that obligating every individual to think seriously is not 
Counterclaim — gives an opposing reason to the final necessary and even annoying as some people may choose to just follow the 
claim. great thinkers of the nation. 
Even though people can follow others’ steps without thinking seriously in 
Rebuttal A claim that refutes a counterclaim. — some situations, the ability to think critically for themselves is a very important 
survival skill. 
Ideas or examples that support For instance, the presidential debate is currently going on. In order to 
Data primary claims, counterclaims, or choose the right candidate, voters need to research all sides of both candidates 
rebuttals. and think seriously to make a wise decision for the good of the whole nation. 
, : Tos thinking seriously is important in making decisions because 
Concluding A concluding statement that restates ee ee ee 
f each decision has an outcome that affects lives. It is also important because if 
Summary the claims. : . : 
you think seriously it can help you succeed. 
Any text that doesn’t fall into any of People always strive to be unique or different. This idea clashes with 
Non-annotated é ; 5 
the above categories creativeness all through our lives. 


NLP features from SALAT 
Bigram lemma type-token ratio 
Brown frequency score 


Brysabaert concreteness score 
COCA academic bigram 
association strength 
Dependents per clause (SD) 
Dependents per object of the 
preposition (SD) 

Direct objects per clause 


Free association tokens response 
score 


Hu Liu proportion score 


LDA age of exposure score 


Lexical decision time 


Nouns as modifiers score 
Number of named entities 
Number of prepositions per clause 


Objects component score 
Possessives component score 
Sentiment score of dominance 
Sentiment score of overstating 


T-units per sentence 


Verb argument constructions 
association strength 


B. Descriptions of the SALAT NLP features 


Descriptions 


Number of unique bigram lemmas (types) divided by the number of total bigram lemmas (tokens) 
Mean word frequency score based on London-Lund Corpus of Conversation 


Sum of concreteness scores based on all words divided by number of words with concreteness scores 


Sum of approximate collexeme strength score divided by the number of bigrams in text with collexeme 
scores 


The standard deviation of the total number of dependents per clause 
This score captures the variation (standard deviation) in the prepositional objects 


The number of direct objects per clause 

Number of response tokens elicited by word as stimuli in discrete word association experiment (based 
on function words) 

Proportion of the number of words with positive sentiments to the words with negative sentiments 


Based on Incremental Age of Exposure for words across 13 grade levels; calculated based on 1/slope 
of linear regression 

Standardized lexical decision reaction time across all participants for this word (z-score, based on 
function words) 


This score captures the use of nouns as modifiers and modifier variation 
The number of named entities 
This score captures capture noun phrase elaboration and clause complexity 


This component score represents the number of terms that reference objects 


This component score captures the use of possessives in general, and specifically captures the use of 
possessives in nominal subjects, direct objects, and prepositional objects 


This score captures the sentiment of dominance, measured by the number of words of dominance 


This score captures the sentiment of overstating, calculated based on words indicating emphasis in 
realms of frequency, causality, accuracy, validity... 


Number of T-units in text divided by number of sentences in text 


Average approximate collostructional strength score based on the COCA academic corpus 


Note. For more information about the SALAT NLP features, please see https://www.linguisticanalysistools.org/ 
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ABSTRACT 


We present an empirical study on the use of keystroke ana- 
lytics to capture and understand how writers manage their 
time and make inferences on how they allocate their cogni- 
tive resources during essay writing. The results suggest three 
distinct longitudinal patterns of writing process that de- 
scribe how writers approach an essay task in a writing assess- 
ment. Discussion of the potential applications of keystroke 
analytics for improving teaching, learning, and assessing 
writing are also provided. 


Keywords 
keystroke logging, writing, cognitive process, time manage- 
ment, writing pattern 


1. INTRODUCTION 


The study of writing process has long been of interest to the 
writing research community (e.g., (4), {18}, [12], [79], [22]). 
With the advances in technology, keystroke logging has be- 
come a practical and popular tool to capture and study the 
process of composition in a wide range of contexts |10|. In 
this study, we demonstrate some research findings on the use 
of keystroke analytics to understand writers’ time manage- 
ment during their writing process. The results have practi- 
cal implications for the teaching and learning of writing in 
classrooms. 


Previous research on writing cognition suggested several sub- 
processes of writing [9], including task analysis, text plan- 
ning, idea generation, translating ideas into natural lan- 
guage, transcribing langauge onto paper (handwriting) or 
a screen (keyboard-based writing), text revision, copy edit- 
ing and reviewing. Figure [1] illustrates a simplified version 
of Hayes coginitive writing model, which specified four main 
subprocesses of writing. Specifically, idea generation and 
task preparation (i.e., proposer) often manifest as pauses at 
the start of writing and at sentence boundaries; fluency of 
putting ideas into language (i.e., translator) primarily re- 
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lates to the size of long sequences of text production with- 
out major interruption (also known as “burst”); orthographic 
proficiency and motor skill (i.e., transcriber) typically relates 
to pauses inside a word and to edits designed to make im- 
mediate corrections to spelling errors or typos; and editing 
and reviewing (i.e., evaluator) usually show up as jumps to 
different locations in the text to make changes or replace 
large chunks of existing text with new content. 


Evaluator 


Proposer ~——>| Translator ——>| Transcriber 


Figure 1: A Cognitive Model of Writing Process 


One important implication from the cognitive model is that 
writing is not a linear process and successful writing calls for 
effecitive management and coordiation of the subprocesses. 
Drawing from the cognitive theories of writing, an overview 
of the types of activities occuring during text composition 
can be found in [5]. The cognitive resources, as stated in I, 
required to carry out each activity do not distribute ran- 
domly over the text-production process. Writers often need 
to decide on which goals to prioritize at which time point be- 
cause they simply do not have unlimited working memory to 
accomplish everything at once fia]. With the availability of 
keystroke logs, how writers distribute their time and cogni- 
tive resource to various subprocesses of writing can be quan- 
tified and analyzed, which is described in the next section. 
In this study, we aim to tackle a specific research question 
of whether there are distinct writing-process patterns that 
may be detected with regard to how writers allocate their 
cognitive resource to various subprocesses of writing. An 
identification of meaningful writing profiles will have practi- 
cal implications for instructors to design curriculum suitable 
to their classes, and personalize their instruction for learners 
with different needs and characteristics. 


1.1 Keystroke Logging 

We consider keystroke logging as a recording of every key- 
press that the writer makes on the keyboard. The gap time 
between two consecutive keypresses is often called an in- 
terkey interval (IKI), which is also recorded in keystroke 
logging. A single keystroke record in JSON format may 
look like this: {“p”: a he “oO”: a Sys ao Cae boa a “0.57"}, where 
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Table 1: An Example Keystroke Log Segment 


Index PosInText Content ContentLen TimeStamp ActionType Context WordIntended CursorJump TextToDate 
0 1 I 1 0.57 Inser WordStart It N I 
1 2 t 1 0.62 Inser InWord It N It 
2 3 (space) 1 0.99 Inser Between Words N It 
3 4 i 1 1.30 Inser WordStart is N It i 
4 5 s 1 1.42 Inser InWord is N It is 
5 6 (space) 1 1.61 Inser Between Words N It is 
6 7 a 1 2.12 Inser WordStart a N It isa 
7 8 (space) 1 2.22 Inser Between Words N It isa 
8 9 g 1 2.35 Inser WordStart fun N Itisag 
9 10 ° 1 2.43 Inser InWord fun N It is a go 
0 11 ° 1 2.55 Inser InWord fun N It is a goo 
1 12 d 1 2.68 Inser InWord fun N It is a good 
2 12 d 1 3.01 Delete InWord fun N It is a goo 
3 11 ° 1 3.16 Delete InWord fun N It is a go 
4 10 ° 1 3.30 Delete InWord fun N Itisag 
5 9 g 1 3.53 Delete InWord fun N Itisa 
6 10 f 1 3.71 Insert InWord fun N Itisaf 
iq 11 u 1 3.93 Insert InWord fun N It is a fu 
8 12 n 1 4.11 Insert InWord fun N It is a fun 
9 13 (space) 1 4.30 Insert Between Words N It is a fun 
20 14 d 1 4.49 Insert WordStart day N It is a fun d 
21 15 a 1 4.62 Insert InWord day N It is a fun da 
22 16 y 1 4.90 Insert InWord day N It is a fun day 
23 17 1 5.13 Insert PuncMark N It is a fun day. 


Note: ContentLen=Content Length. ContentLen usually takes the value of 1 unless the writer cuts or pastes in a chunk of text with more than 
one character. PuncMark=Punctuation Mark. CusurJump can take a binary value of “Y” or “N”. 


“p” is the position in the text box, “o” is the current text 


at that position, “n” is the change made to that position, 
and “t” is the time elapsed since the start of writing. In this 
example, the writer inserted a chatacter “I” at position 1 in 
the text box at a timestamp of 0.57 seconds, computed rela- 
tive to when the writing started (i.e., at 0 elapsed seconds). 
The overall behaviral process of text production can then 
presented by a sequence of keystroke records. More impor- 
tantly, qualitative labels may be attached to characterize a 
keystroke record in terms of the type (e.g., insertation, dele- 
tion) and location (e.g., inside of a word, end of a sentence) 
of an action, along with the content and associated time 
stamp. For the hypothetical example given in Table }1| (for 
illutration purpose only), the writer spends 5.13 seconds to 
write a full sentence: “It is a fun day.” During the process, 
the writer changed the choice of a word from “good” to “fun” 
evidenced by a sequence of the “delete” actions. Cursor lo- 
cation is tracked so that if the cursor moves suddenly to a 
different location away from the current location, the jump 
behavior can be detected. 


As Table[l] indicates, keystroke logs allow the visible aspects 
of the text-production process to be precisely reconstructed 
and retrospectively replayed. Figures |2] and |3| demonstrate 


one approach to visualizing the dynamics of the text-production 


process by plotting the time elapsed (horizontal axis) against 
text length and cursor position (verticle axis). When the 
writer is appending or deleting text at the end of the text, 
the dashed purple line (text length) and the solid green line 
(cursor position) would converge; when the writer is making 
changes elsewhere in the text, the green cursor-line would 
diverge from the purple length-line. The gaps between the 
two lines can indicate the degree of the “jump” action. The 
length-line can go up or down indicating adding or removing 
of content. The small-scale zig-zag pattern in both figures 
suggests that both writers conducted a fair amount of quick 
fixes or local edits mostly on the word level (e.g., typo cor- 
rection, word-choice revision, removing/adding punctuation 
marks) at the end of the text as they write. The writer in 


F igure[3|showed evidence of global-level editing behavior to- 
wards the end of the writing session, when the writer moved 
the cursor to different parts of the text to make changes, as 
can be seen from the relatively large gaps between the purple 
and green lines. This type of jump-edit behavior is rather 
absent in Figure [2] for which the writer showed a much more 
linear writing process. 


400 00 


Figure 2: Writing Pro- 
gression Example a 


Figure 3: Writing Pro- 
gression Example b 


1.2 Inferences & Relations to Writing Quality 
The nature and location of the changes that writers make 
to their text directly can directly support inferences about 
where the writer is cognitively in the composition process. 
For example, a long pause followed by insertion of a written 
outline is suggestive of task analysis and idea generation; a 
long pause at the phrasal or sentence boundaries followed 
by a burst of text production is a sign of sentence planning; 
alternating between insert and delete actions on a character- 
level inside of a word is likely an indication of spelling cor- 
rection or word finding; if a writer types long sequences of 
words thereby adding new content, it is reasonably safe to 
assume that the writer is primarily engaged in content gen- 
eration; and, if a writer jumps to various locations (tracked 
by cursor position) in the text to make changes, the writer 
is more likely in the state of text reviewing and revision 
(e.g., (4, [16]). Previous research has reported that the 
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process of writing, such as the total time spent on writ- 
ing, between-word pause tempo, initial pause length before 
typing a word, length of long burst (i.e., stretches of long 
sequence of text production), and extent of text editing and 
revision, relate to the quality of the final written product 
(e.g., [3], 17], I, [27]). In this study, we largely followed 
the practice described in |8] by classifying each interkey in- 
terval (i.e., gap time betweeb two keystrokes) into one of 
the following four cognitive states in writing. These states 
intend to operationalize the theoretical subprocesses pro- 
posed in Hayes model, although there will unavoidably be 
gaps between theory and practice. Long Pause state, repre- 
senting text planning, idea generation and deliberation, or 
hesitation and struggle with text production; Text Produc- 
tion state, representing relatively fluent content generation 
without major interruption where interruption is signaled 
by an extended pause; Local Editing state, representing lo- 
calized (mostly on the word-level) minor text editing; and 
Global Editing state, representing reviewing, revision and 
copy editing on the whole passage/text level. 


2. METHOD 
2.1 Data Set 


The data set was collected from a high-school equivalency 
testing program, which contains five subject tests: English 
language arts — reading, English language arts — writing, 
math, science, and social science. The focus of this study 
was the essay writing task in the writing subtest. In re- 
sponding to the essay task, the examinees are expected to 
read two sources with different perspectives on a common 
issue (e.g., whether success is more the result of talent or of 
hard work), and then express and explain their opinions in 
writing while appropriately incorporating evidence from the 
sources. Each submitted essay was scored holistically on a 
0-6 scale by two trained human raters according to a stan- 
dardized grading rubric. Essays receiving a human score of 
0 were excluded from analysis as those essays tend to have 
aberrant characteristics such as being empty, not in English, 
or consisting of random keystrokes. In this study, we se- 
lected two writing forms administered between September 
2017 and August 2018 for investigation. Each form con- 
tained one essay writing task, or prompt. The sample size 
used for analysis was approximately 500 in each prompt. 
The analyses were conducted on the first (base) prompt, 
and then replicated on the second prompt to validate the 
consistency of the findings. 


2.2 Propensity Score Matching 

To ensure comparability, propensity score matching (PSM) 
was used to minimize irrelevant factors such as perfor- 
mance level and the paraticipants’ demographics and to bal- 
ance covariates between the participants who responded to 
either of the two prompts. Also, the two different prompts 
were not administered at random, thus necessitating this 
step. A logistic regression model was developed to gener- 
ate the propensity scores, and a one-to-one greedy matching 
without replacement algorithm with a caliper value of 0.05 
was applied to find the matches in the prompt 2 sample to 
make it most comparable to the prompt 1 sample [15]. The 
caliper value refers to the maximum distance in propensity 
scores; hence the smaller the caliper value, the closer the 
match. In performing the PSM, the covariates were chosen 


based on our understanding of the examination and the ex- 
aminee population, and on findings from previous reports 
on subgroup differences in writing process (e.g., (7, [25]). 
The covariates included for propensity score matching were 
gender (Male or Female), ethnicity (White, Black, Hispanic, 
or others/unreported), employment status (Full-time, Part- 
time, Unemployed, or others/unreported), highest education 
level (Below Grade 9, Some high school, others/unreported), 
English as best communicative language (Yes or No), as well 
as scores on the subject tests other than writing. All the de- 
mographic background variables were self-reported by the 
participants on a voluntary basis. 


2.3 Feature Extraction 

Keystroke logs were recorded automatically as writers com- 
posed their essays. A two-stage procedure was applied for 
feature extraction. In Stage 1, we classified each interkey in- 
terval (IKI) into one of four heuristically-defined and mutually- 
exclusive writing states by following the practice in |8| with 
some modifications. In Stage 2, we split each log into ten 
time periods by evenly dividing the total writing duration 
into ten segments, as one way to align and compare the 
logs of different length in duration. The choice of ten time- 
periods was made to balance the duration of each segment, 
which should be long enough to detect any patterns related 
to time distribution, and the number of segments, which 
should be sufficient for detecting longitudinal patterns. 


Stage 1: Classification of writing states. The general pro- 
gramming logic is as follows. 


e Step 1. Define Long Pause (LP) state. If an IKI is 
longer than n times in-word typing speed, it is then 
labelled as P. The keystroke sequence in between two 
adjacent Ps is considered a burst. 


e Step2: Define Text Production (TP) state. Inside of 
a burst, if there is an absence of Delete action, or the 
max number of a Delete sequence is smaller than k, 
label all IKIs in this burst as TP. If there is a consec- 
utive delete action sequence with k or more number of 
Deletes, temporarily change all IKIs in this burst to 
R, which will be refined in the next step. 


e Step 3. Define Local Editing (LE) state. Use an m- 
IKI moving window to scan through the keystroke se- 
quence within an R-burst. That is, the first moving 
window contains the 1%* to the m‘" IKIs in a burst; 
the second moving window contains the 2"? to the 
(m+1)" IKIs in the burst; and so on. 


— If all IKIs in a moving window are Inserts or con- 
tain less than s Deletes, change the first record in 
the moving window from R to TP. Continue with 
the same logic to the next moving window. 


— If a moving window contains equal to or more 
than s Delete actions, label all the m records in 
the moving window as LE. 


e Step 4. Define Global Editing (GE) state. GE is indi- 
cated by text deletion while crossing sentence bound- 
aries or making jump-edits elsewhere in the text away 
from the current location. 
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— If d or more consecutive Delete actions contain 
the following punctuation marks — comma, pe- 
riod, semicolon, exclamation mark, and question 
mark — label the entire Delete sequence as GE. 


— If the position of an IKI in the text is different 
from the one before it by more than gq, then the 
IKI is labeled as Jump Back (JB to an earlier 
place in the text) or Jump Forward (JF to a later 
place). Find the longest JB-JF pair in distance, 
and label all the IKIs in between to GE. 


The parameters n, m, k, s, d, and q used in state definitions 
are customizable. In this study, we chose n=10, m=3, k=3, 
s=3, d=2, and q=4 as a starting point, mainly following (3). 
Once all the IKIs are labelled, consecutive IKIs of the same 
state can be further aggregated. Each keystroke log can be 
described by the total number of states, the duration of each 
state, proportion of time spent on each state, the frequency 
of various transitions, etc. Among the four states, there are 
12 state-transition possibilities (e.g., TP — GE) in total. 


Stage 2: In this stage, we used the state classification gener- 
ated above to calculate a writer’s time distribution at various 
points during the writing process. To accomplish this goal, 
we divided each keystroke log into ten even time-periods. We 
then calculated the proportion of time spent in each writing 
state within a time period. As a result, each keystroke log 
was represented by a vector of 40 elements (i.e., four states 
times ten segments). The elements’ values (i.e., percent- 
ages) could range from 0 to 100. When it is a 0, it simply 
means that the writer did not spend time on an activity 
(e.g., Local Editing) during that time period (which is one 
tenth of the total time). Similarly, when the value is 100, 
it means that a writer spent all his/her time on an activity 
(e.g., Global Editing) during that time period. With this 
information, the longitudinal pattern of time allocation can 
be revealed and investigated. We can analyze, for example, 
how writers spend their time at the beginning, middle, or 
end of their writing process, whether writers distribute their 
effort evenly or differently at various time-points during the 
writing process, and whether there are distinct profiles with 
regard to time management of an individual writing process. 


2.4 Cluster Analysis and Interpretation 

Using the writing samples in each of the two prompts, we 
conducted hierarchical cluster analysis (agglomerative ap- 
proach) with Ward linkage using the Euclidean distance met- 
ric [23] [20]. The proportion of time spent in each state at the 
ten time-points was used to create 40 input variables. Each 
input variable was standardized to a mean of zero and stan- 
dard deviation of one across individuals in a prompt [13]. 
The cluster analysis was done separately for each prompt, 
with the second prompt serving as a replication sample to 
verify results from the base prompt. Because there is no es- 
tablished convention for choosing the number of clusters, we 
used the Pseudo-F statistic, model R-squared, and semipar- 
tial R-square statistics to help us determine the appropriate 
number. The pseudo-F statistic is calculated as the ratio 
of the between-cluster variance to the within-cluster vari- 
ance [14]. Larger values indicate better separation between 
the clusters. The model R-squared indicates the proportion 
of variance accounted for by the clusters. The semipartial 


R-square indicates the decrease in the proportion of vari- 
ance accounted for due to joining two clusters. We plotted 
these statistics against the number of clusters to examine 
the impacts of joining or splitting clusters. Dendrograms vi- 
sualizing the distances between the keystroke logs were also 
examined to help select the final number of clusters. 


To interpret identified clusters, we compared how writers 
falling into different clusters allocated their efforts during 
the writing process. Since the proportions of time spent on 
Text Production, Local Editing, Global Editing, and Long 
Pause were used as input variables in the cluster analysis, 
this comparison would be the most direct way to examine 
any distinct patterns exhibited by each cluster. It would 
also be informative to know whether clusters are associated 
with distinct patterns of writing proficiency or in cluster 
members’ demographic background. Therefore, to further 
substantiate the meaning of identified clusters, we compared 
the essay scores, essay length (in words), time on task (in 
minutes), the proportion of cluster members belonging to 
specific demographic categories, as well as a rough measure 
of writing efficiency calculated as essay length divided by 
time on task, between the clusters. 


3. RESULTS 
3.1 Outcome of Propensity Scoring Matching 


The samples resulting from the propensity score matching 
are closely comparable between the two prompts (Table[2). 
In the first (base) prompt, males and females were evenly 
distributed; 53% of the examinees self-identified as White, 
12% as Black, 14% as Hispanic; 4% reported that their high- 
est grade level was below Grade 9, 62% reported having some 
high school education; 16% were working part-time, 17% 
were working full time, 23% were unemployed at the time of 
the examination. The majority of the examinees, 94%, in- 
dicated English as their best communicative language. The 
demographic background distribution of the second prompt, 
after matching, was very similar to that for the base prompt. 
The matched samples for prompt 2 also showed comparable 
means of the subtest scores to those for the base prompt. 


3.2 Cluster Analysis Results 


To decide the optimal number of clusters, we first examined 
the dendrograms, which indicated a solution of 3 or 4 clus- 
ters for both prompts. Figures [4] and show the results 
of the Pseudo-F statistic and Sempartial R-square statis- 
tic that were considered for model selection in prompt 2. 
The results for prompt 1 are similar. The X-axis in both 
plots is the number of clusters ranging from 1 to 50. The 
Y axis is the Pseudo-F statistic on the left plot, R-squared 
on the middle plot, and semipartial R-square on the right 
plot. The results suggested a peak on the Pseudo-F statis- 
tic with a 3-cluster solution. The semipartial R-square plot 
shows an elbow point at cluster 3. The model R-squared 
(not shown) appears to go up continuously without a clear 
turning point. Considering all the evidence, we decided on 
three clusters as the most parsimonious and sensible solution 
for the this study sample. 


3.3. Comparing the Clusters 
Given the multivariate nature of the clustering variables, we 
drew radar charts to visualize the time distribution at the 
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Table 2: Comparability between Prompts after Propensity Score Matching 


Proportion 
Prompt Female White Black Hisp. Below9 HS 
1 (base) 0.50 0.53 0.12 0.14 0.04 0.62 
2 (matched) 0.51 0.52 0.13 0.14 0.02 0.63 


PT 
0.16 
0.16 


Mean Score (Scale: 0 - 20) 
FT Unemp. Eng(Y) | Reading Math Science SS 
0.17 0.23 0.94 12.23 11.55 14.11 13.41 
0.19 0.20 0.94 12.19 11.73 13.97 13.55 


Note: Below9: Below Grade 9; HS: some high school; PT: part-time employee; FT: full-time employee; Unemp.: unemployed; Eng(Y): English 


as best communicative language; SS: Social Science 


Pseudo F Statistic against Number of Clusters, ‘Semipartial R-square against Number of Clusters 


Figure 4: Pseudo-F 
statistic x N of Cluster 


Figure 5: Semipartial 
R? x N of Cluster 


ten consecutive time-points during the writing process of 
the logs belonging to each of the three clusters (Figures (6| 
and [7). The axes represent the average proportions of time 
spent on a state at a certain time. For example, “TP_1” 
refers to the proportion of time spent on Text Production 
at the beginning of writing — the 1** of the ten duration 
segments. It is clear from Figures [6] and [7| that the three 
clusters demonstrated rather different polygonal shapes over 
all axes, which are consistent between the two prompts. 


Figure 6: Radar Chart 
(Prompt 1) 


Figure 7: Radar Chart 
(Prompt 2) 


Cluster 1 (blue colored, solid line in both prompts) writers 
have a distinct pattern with notable spikes on the TP state 
over the course the writing process. Because the total writ- 
ing time is constrained and writers can only do one thing 
at a time, if the writers spend more time on text produc- 
tion, they would necessarily spend less time on the other 
activities such as editing or revision. This constraint is ev- 
ident from the plots where, for Cluster 1, the proportion of 
time spent on long pauses and editing is relatively smaller. 
Cluster 2 (orange colored, dashed line) writers, on average, 
have a much smoother circle compared to Cluster-1. Clus- 
ter 2 writers appeared to have distributed their efforts more 
evenly throughout the writing process. The allocation of 
time across the four writing states is relatively balanced over 
the course of the writing session. In general, Cluster 1 seems 
to represent a group of writers that compose linearly with- 
out showing much editing behaviors, while Cluster 2 seems 
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to represent writers that also consistently produce text but 
still spend time on text planning and conduct text editing 
and revision as they write. Cluster 3 (green colored, dot- 
ted line) writers further showed a distinct time-management 
pattern from the other two clusters. The writers in Cluster- 
3 appeared to have difficulties in generating text at the start 
of the writing session, as evidenced by the lack of text pro- 
duction (TP_1) and a higher proportion of the local editing 
behaviors (GE_1 and LE_1) in the first time period, which 
suggested possible false starts during the writing process. 
This “struggling” pattern appeared to have persisted into 
later stages of the writing process, as evidenced by a higher 
proprotion of time spent on long pauses compared to text 
production or editing. 


We also examined the actual length of time writers stayed 
in a state before transitioning to a different state. The re- 
sults suggest that the three clusters not only differ in their 
relative time distributions during writing, but also in the 
total time writers stayed in different states. Each cluster 
displayed distinct patterns consistent across prompts: Clus- 
ter 1 writers spent considerably less time on long pauses 
than Clusters 2 and 3 writers; Cluster 2 writers spent no- 
tably larger amounts of time making word-level local edits 
than Clusters 1 and 3 writers by approximately 1 minute 
in Prompt 1 and 2 minutes in Prompt 2; and Clusters 3 
writers generally spent less time on text production than 
Clusters 1 and 2 writers by about 1-2 minutes in Prompt 
1 and about 3.5 minutes in Prompt 2. An additional inter- 
esting difference between Clusters 1 and 2 is that Cluster 2 
writers not only spent longer time on local editing, but also 
on global editing by about 3 minutes in both prompts. A 
close comparison between Clusters 2 and 3 further revealed 
that Cluster 2 writers also appeared to spend more time 
on long pauses than Cluster 3 writers by a small margin in 
Prompt 1 and by a rather large margin in Prompt 2. 


Taking into account both absolute time spent in each state, 
and relative time in each state, Cluster 2 writers appear 
to have shown a stable-tempo, iteractive process pattern in 
which they switch repeatedly between the activities of text 
planning, text production, and text editing over the course 
of their writing sessions. Although more data are needed to 
verify this interpretation, the long pauses demonstrated by 
Cluster 3 writers throughout the writing session appeared to 
be signals of hesitation and difficulties in content generation. 
Finally, Cluster 1 writers seem to be relatively quick and flu- 
ent at generating ideas (as evidenced by fewer long pauses) 
and at translation and transcription (that is, at expressing 
their ideas in written form). 


To better understand and interpret the identified clusters, 
Table |3| further compares the characteristics of student es- 
says across the three clusters. Cluster 1 writers spent the 
shortest time on the writing task on average (22.94 minutes 
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Table 3: Proficiency and Demographic Distributions of Clusters 


Prompt 1 Prompt 2 

Variable Cluster 1 Cluster 2 Cluster 3 Cluster 1 Cluster 2 Cluster 3 

n 65 264 168 169 259 71 
Essay Score (scale: 2 - 12) 5.51 (1.70) 5.64 (1.54) 5.43 (1.53) 5.38 (1.59) 5.42 (1.74) 4.51 (1.80) 
Essay Length (in words) 305 (136) 324 (145) 289 (128) 287 (129) 290 (127) 236 (118) 
Time on Task (in minutes) 22.94 (11.56) 36.60 (15.65) 32.36 (14.16) | 25.45 (12.39) 36.04 (14.86) 26.88 (14.37) 
Efficiency (words/min) 15.02 (6.19) 9.94 (4.43) 9.96 (4.09) 12.85 (5.71) 8.97 (3.98) 10.82 (6.41) 
Female 0.52 0.56 0.43 0.49 0.54 0.39 
White 0.62 0.51 0.50 0.58 0.51 0.46 
Black 0.12 0.14 0.13 0.12 0.12 0.13 
Hispanic 0.06 0.13 0.18 0.09 0.17 0.13 
Proportion Below Grade 9 0.03 0.02 0.02 0.05 0.03 0.03 
Some high school 0.54 0.63 0.64 0.60 0.62 0.68 
Part-time 0.15 0.13 0.20 0.17 0.16 0.14 
Full-time 0.12 0.19 0.22 0.17 0.17 0.20 
Unemployed 0.20 0.21 0.18 0.21 0.24 0.28 
English as Best (Y) 0.96 0.94 0.96 0.96 0.92 0.94 


Note: Values in parenthesis are standard deviations. Values on the lower-half of the table are the percent of various subgroups within a cluster. 


in Prompt 1; 25.45 minutes in Prompt 2). The difference in 
the total time on task between Clusters 1 and 2 is drastic — 
about 14 minutes difference in Prompt 1 and about 10 min- 
utes difference in Prompt 2. Cluster 1 writers wrote notably 
more words/minute than Cluster 2 writers (15.02 vs 9.94 in 
Prompt 1; 12.85 vs. 8.97 in Prompt 2). The overall evidence 
seems to suggest Cluster 1 writers were more efficient than 
Cluster 2 writers, in that they spent significantly less time 
writing, yet achieved comparable text quality (essay scores). 


In relation to demographic background, several results are 
noteworthy. On both prompts, Cluster 1 contained a no- 
tably greater proportion of White writers, a lower propor- 
tion of Hispanic writers, and a lower proportion of exam- 
inees with high-school experience, compared to the overall 
demographic distribution in Table [2] Cluster 2 included a 
slightly greater proportion of female writers than the aver- 
age on both prompts, while all other demographic variables 
fell close to the mean for each prompt. Finally, Cluster 
3 had a considerably lower proportion of White or female 
writers. But the results in general are less consistent be- 
tween the two prompts for Cluster 3. The evidence seems to 
suggest that writers from different demographic background 
and having different educational experience may display dis- 
trinctive patterns in their writing processes. However, with- 
out further evidence, we cannot infer any causal connection 
between demographic group membership, writing process 
patterns, and overall writing performance. 


4. DISCUSSION 


In this paper, we presented a study on the use of keystroke 
analytics to understand writers’ cognitive processes during 
writing. One possible outcome of this study, and of the 
larger research program of which it is a part, would be 
to providing actionable writing feedback to instructors and 
learners. However, before we can reach this goal, we need 
to develop a clear understanding of how writing processes 
change as a result of learning and instruction. This study 
provides a first attempt to address this issue, by identify- 
ing characteristic longitudinal patterns of time management 
that writers display when they respond to an essay writ- 
ing task. The current results suggest that there are at least 
three distinct writing profiles that describe how writers ap- 
proach an on-demand essay writing task. Though, it will be 
critical that the analysis to be replicated with more data. In 
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the future, for writers placed into different profiles, we can 
imagine giving customized suggestions on writing strategies 
to improve learning and practice. Obviously, more research 
is needed to further validate the meaning and interpreta- 
tion of the profiles we have detected. Possible approaches 
include cognitive interviews to elicit writers’ understanding 
of what they were doing, combining eye-tracking technique 
with keystroke logging to get a better sense of where the 
writer’s attention was focused and how changes in the focus 
of attention interacts with pause patterns, and convening an 
expert panel to determine whether the clusters derived by 
atatistical analysis align with expert judgments about what 
the writers were doing at each point in the writing process. 
The availability of keystroke logs makes it possible to replay 
a writer’s composition process like a movie. Such replays 
can then be presented as stimuli to assist and guide cog- 
nitive interviews or the expert review process. Statistical 
tests will be necessary to detect if the profiles are signif- 
icantly different beyond the practical importance. It will 
also be essential to replicate our analyses on a wide range 
of writing prompts, a broader variety of writing tasks, and 
across many different writer populations, as well as to study 
how well findings resulting from timed-writing tasks can be 
generalized to writing tasks with no time restriction. 


The association between cluster assignment and demographic 
background is worth further investigation. The study of 
writing process ought to integrate with the social and lin- 
guistic context [21]. Previous studies have reported sub- 
group differences in writing processes (e.g., between native 
and non-native speakers in [24], between male and female 
writers in , between black and white students in [7]). It is 
concievably helpful and valuable to give information about 
writing profiles to provide customized feedback to writers 
from different linguistic, social, and educational backgrounds. 


Finally, although this is beyond the scope of the current 
study, it is worth mentioning that keystroke-enabled pro- 
cess visualization such as those illustrated in Figures [2] and 
in and of itself, may have instructional value, by mak- 
ing it easier for students to understand and self-reflect their 
writing processes. For instance, teachers may select replays 
and graphs to demonstrate a specific writing subprocess or 
a writing strategy, to help the class understand how to im- 
plement a more effective writing process. Teachers might 
also be able to use such graphs during their one-on-one con- 
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ference with their students to help them better understand 
their writing strengths and weaknesses (such as lack of edit- 
ing, low-level engagement, or lack of sufficient attention to 
idea generation) that are revealed by the keystroke log. Fu- 
ture research is encouraged to gather teacher and students 
feedback on the assistive value of keystroke logs in their 
teaching and learning experience. 
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ABSTRACT 


Predicting academic performance using trace data from learn- 
ing management systems is a primary research topic in edu- 
cational data mining. An important application is the iden- 
tification of students at risk of failing the course or dropping 
out. However, most approaches utilise past grades, which 
are not always available and capture little of the student’s 
learning strategy. The end-to-end models we implement pre- 
dict whether a student will pass a course using only naviga- 
tional patterns in a multimedia system, with the advantage 
of not requiring past grades. We experiment on a dataset 
containing coarse-grained action logs of more than 100,000 
students participating in hundreds of short course. We pro- 
pose two approaches to improve the performance: a novel 
encoding scheme for trace data, which reflects the course 
structure while remaining flexible enough to accommodate 
previously unseen courses, and unsupervised embeddings ob- 
tained with an autoencoder. To provide insight into model 
behaviour, we incorporate an attention mechanism. Clus- 
tering the vector representations of student behaviour pro- 
duced by the proposed methods shows that distinct learning 
strategies specific to low- and high- achievers are extracted. 


Keywords 

learning strategies, academic performance prediction, navi- 
gational patterns, mooc, lstm, autoencoder, learning man- 
agement systems 


1. INTRODUCTION 


A large amount of trace data about learner behaviour has 
recently become available from online learning environments 
[30]. Hence, it is now possible to improve the delivery, assess- 
ment and intervention quality using data mining techniques, 
giving rise to technology-enhanced learning. Educators are 
especially interested in receiving early alerts when students 
are at risk of failing the course or dropping out. With these 
alerts, timely intervention can be organised. To estimate 
this risk, machine learning classification models are built to 
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predict student performance based on their interaction with 
the course content. There are several ways to define student 
performance: it can be a binary or a multi-level grade on 
either the next exercise or the entire course. This study fo- 
cuses on a binary grade for the given course (fail or pass), 
a common scenario on student performance prediction [15, 
28, 27]. 


Traditionally, the primary source of information for this task 
is past academic performance records [31]. They might in- 
clude the grades for the previously taken courses or interme- 
diate test scores. However, apart from those, online learn- 
ing platforms also provide information about other types 
of interaction between students and the content. Depend- 
ing on the technical implementation and medium, it can be 
fine-grained click-stream video data, the text of discussion 
forum messages, or more coarse-grained information such 
as whether a person liked a video or performed a search 
query. Our models operate on such coarse-grained sequen- 
tial records of interacting with different content types on 
the online learning platform and does not rely on previous 
grades. The motivation for it is three-fold. First, the past 
scores might not be available, or it can be time- and effort- 
consuming to provide them. For example, grading an es- 
say usually requires a specialist, which is not scalable for 
MOOCs. Second, by focusing on interaction with the con- 
tent, such as video views, we obtain the representation of 
student behaviour that is more likely to capture their learn- 
ing strategies. In other words, we can discover whether they 
prefer a specific medium or actively interact with other stu- 
dents in online discussions. Third, if we work with naviga- 
tion patterns, the resulting model has the potential to inform 
a recommendation system that would nudge a struggling 
student in the right direction. For instance, when students 
explore the platform, we can automatically recommend the 
next learning item to interact with (e.g., a video or a reading 
material they might find useful or interesting). 


This study applies recurrent neural networks for academic 
performance prediction in short online courses. The ap- 
proach we describe works without manual feature engineer- 
ing or information about previous scores. In contrast to most 
works in the area of course grade prediction, our models op- 
erate instead on raw sequences of multiple-type interactions 
with the content (video view/like, discussion message, search 
request, exercise attempt). We propose several encoding 
schemes for action logs and demonstrate that they can in- 
crease classification performance. Among them, we intro- 
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duce a flexible scheme that reflects relative progress within 
the course while being independent of its length. We also 
use autoencoders to extract student representation in an un- 
supervised way. Vector representations of student behaviour 
produced by different methods are clustered to compare how 
well they indicate (un)successful learning strategies. In or- 
der to illuminate the inner workings of the model, we pro- 
vide cluster visualisations and experiment with an attention 
mechanism. 


Thus, our research questions are the following. RQ1: Can 
we capture the information about course structure in the 
action sequences in a flexible way (that can be extended dy- 
namically when new exercises or longer courses are added)? 
RQ2: Can we improve predictive performance by pretrain- 
ing unsupervised embeddings (using autoencoder models)? 
RQ3: Can we improve predictive performance further by 
adding attention to an autoencoder? 


2. RELATED WORK 


While working with educational trace data, there are vari- 
ous ways to extract features from the raw sequences. The 
easiest and perhaps the most popular approach is to count 
the number of actions — despite its simplicity, it is a solid 
baseline, used even in the most recent studies [27], [28], [21]. 


Unfortunately, by aggregating the data in such a way, we 
ignore the time delay between actions and the information 
about their order. A common way to engineer additional, 
more time-sensitive features from trace data is to represent 
student behaviour in chunks, called studying sessions. For 
instance, we can define the sequence of actions as a session 
if less than 15 minutes passed between actions. The thresh- 
old is usually determined heuristically [20]. However, the 
need to choose this threshold manually is a clear drawback. 
Moreover, additional manual feature engineering is often re- 
quired to aggregate features over sessions (e.g., experts have 
to define what qualifies as session intensity). 


Furthermore, even though the described measures are sound 
and easy to use, their high volume does not necessarily con- 
tribute to the high quality of learning. Neither does it allow 
actionable insight beyond relatively trivial advice to spend 
more time in the system. As justly noted by the learn- 
ing analytics community, it is the specific learning strategies 
adopted by individual students that are important [12]. 


Therefore, various approaches based on deep learning were 
proposed to overcome these limitations. The sequential na- 
ture of the data lends itself well to the use of recurrent neu- 
ral networks (RNNs) [37]. The pioneering Deep Knowledge 
Tracing (DKT) model applied RNNs and its variations to 
the history of students’ answers to predict whether they will 
answer the next exercise correctly [26]. One of the benefits 
of such neural architectures is that they can complement 
the educational theories proposed by human experts with 
insights obtained in the bottom-up, data-driven way [23]. 
Another benefit is that end-to-end models adapt to new do- 
mains easier and are more cost-efficient. 


The success of DKT led to a new strand of research. Aiming 
to get rid of its simplifying assumptions, such as disregard to 
skill interaction or exercise text, researchers developed more 


advanced neural models for knowledge tracing [8]. 


In contrast to many grade prediction approaches such as 
[18], our approach does not require knowing previous aca- 
demic performance or past grades. Instead, we investigate 
whether a binary course grade can be predicted from the 
trace data alone, with no intermediate exercise scores. Thus, 
an advantage of our approach is that we can detect low- and 
high-achievers without assessing the correctness of students’ 
answers. The benefit is especially important for courses that 
include open-answer questions since those usually require 
costly human experts to be graded reliably (for example, 
essays in humanities subjects). 


Our models work on raw sequences actions without aggre- 
gation of count or session variables. Recent studies on using 
RNNs on click-stream data [22], [17], [7], [16] and [18] are 
conceptually close to our approach in terms of using RNNs 
to work with sequences of actions. However, all of them 
but [18] operate only on video interaction and exercise an- 
swer features, whereas we also include search queries and 
discussion messages. [18] does not explore the autoencoders 
or attention and does not investigate the extracted student 
representation. Moreover, in most of them, aggregation still 
happens: [17] uses cumulative counts, [16] — weekly snap- 
shots and in [22] interaction features are binary per item, 
while we feed the raw sequences as an input to predictive 
models. Besides, their task is to predict the next exercise 
response while we predict the overall course success. 


Concerning the use of deep learning for unsupervised feature 
extraction, only a few recent publications have explored it 
[7]. For example, autoencoders, a popular approach in nat- 
ural language processing [35], have only recently entered the 
educational data mining field [36]. Motivated by this, we ex- 
tend recent end-to-end approaches to feature engineering on 
trace data by using autoencoders (including an attentional 
one) to embed student behaviour. 


Regarding the character of the trace data used, we operate 
on short courses which contain multimedia content (such as 
videos and discussion messages; details are provided in sec- 
tion 3). The dataset also addresses the variability across a 
range of subjects [24]. Moreover, it allows us to showcase 
the methods in a real-life scenario, as the data is collected 
from a commercial educational platform. It should be noted 
that only coarse-grained trace data is available, i.e. there 
are no details of interaction with videos, such as replays. To 
enrich the data representation without changes to the orig- 
inal platform, we introduce several encoding schemes that 
capture the relative progress within the course. 


Despite their promising results, the state-of-the-art deep 
learning methods are black-box models, which renders the 
interpretation of the prediction making process and incor- 
poration of domain knowledge far from straightforward [10]. 
Such lack of interpretability can seriously hinder the adop- 
tion of otherwise efficient models in decision-critical domains 
such as education, as stakeholders cannot control or assess 
the fairness of the process. In order to illuminate how the 
model makes a decision, we can cluster the produced student 
behaviour representations or investigate attention heatmaps. 
In spite of the apparent success of attention mechanisms in 
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Table 1: An example data entry for student behaviour in a 
course. 


user_id 77 
course_id 60 
avg_session_duration 584 
num_sessions 5 
avg_session_intensity 2.6 
session_frequency 0.24 
actions [(‘video_view’ ‘2015-11-13 15:33’), ...] 
views 4 
searches 0 
messages 0 
video likes 3 


natural language processing [2], few researchers have utilised 
them for academic performance prediction so far [25]. 


While our primary focus is on predicting academic perfor- 
mance, it is also necessary to provide an insight into the 
learning strategies of students [12]. While this concept is 
reminiscent of learning styles [9], it avoids their heavily crit- 
icised assumptions [29] by performing analysis bottom-up 
based on raw data, instead of fitting the students in a rigid 
framework. One way to investigate different learning strate- 
gies is to cluster students based on their trace data. The 
resulting clusters implicitly classify students according to 
how they receive and process information or whether they 
are high- versus low-achievers. The obtained cluster assign- 
ment allows us to improve personalisation mechanisms since 
we can now use the student’s preferred mode of interaction 
(for example, adjust the proportion of videos versus readings 
based on how much of a visual learner the student is). Pre- 
vious research in this field relied on Hidden Markov Models, 
pattern mining, or Levenshtein distance between sequences 
[6, 13]; on numerical features, Partitioning Around Medoids 
[11] or k-Means are used. This study clusters the embed- 
dings produced by predictive models and autoencoders. We 
expect students with similar learning strategies to appear in 
the same clusters. By aligning these clusters with academic 
performance scores, we could distinguish strategies typical 
for low- and high- achievers. 


3. DATA & FEATURES 
3.1 Data 


The dataset for this study consists of student trace data ex- 
tracted from an online educational platform for secondary 
school students (12-18 years old), with a focus on mathe- 
matics and Dutch language courses. The dataset contains 
interaction logs for 44 333 students and 467 short courses 
(177 873 students-course tuples) from 20 September 2012 to 
8 August 2020. Each course includes several lessons with 
associated videos, discussion threads and exercises (mostly 
multiple choice). 


The target variable is the score that the student obtains for 
the course. We converted the original score (from 0 to 100) 
into a binary variable (I(score > 50)), as we are interested 
in patterns corresponding to general success or failure in the 
course. Thus, for a given student-course tuple, we need to 
predict 0 if the student is likely to fail the course and 1 
otherwise. 


For this binary classification case, there is a noticeable class 
imbalance: there are 124 248 (70%) instances in class 1 and 
53 625 (30%) in class 0. It is also important to note that 
the courses on the platform are rather short compared to 
most datasets in the field: the median number of actions 
per student-course tuple is 8, and the median duration of 
a course is approximately 14 minutes. In comparison, [7] 
use information about 44 920 students participating in a 4- 
month course focused on a single subject, with 20 interaction 
features. 


3.2 Features 

We distinguish three ways to engineer features (see Table 1 
for the example data entry). Count and session features are 
traditional predictors. Sequences of actions are also often 
aggregated per timestep (e.g. a chapter in the course) in- 
stead of being used as-is. In contrast, we use the sequences 
in their raw, original format as input for neural models. Such 
a general data format can be applied in any online course 
with minimal technical requirements for tracking student 
behaviour, which benefits smaller learning platforms. More 
precisely, the features we use are as follows: 


1. counts of actions X.. For example, the student with id 
77 watched four videos in a course with id 60. These 
features include the number of: video views, video 
likes, messages posted in forums, search queries, ques- 
tions attempted (without making a distinction between 
correct or incorrect answer). 


2. manually engineered session features X,. We experi- 
mented with multiple threshold timeout values (as we 
did not have access to login and logout timestamps 
for students), settling on 15 minutes. We then ag- 
gregated statistics about individual studying sessions, 
resulting in 4 features: average session duration in sec- 
onds, number of sessions, session frequency (the ratio 
of number of sessions to the course duration in hours) 
and session intensity (average number of actions per 
session). This is a typical set of features engineered in 
similar studies [32], which we use as a benchmark. 


3. raw action sequences Xq. There are five possible ac- 
tions: a video view, a video like, a discussion forum 
message post, a search request for a term, and an at- 
tempt to answer an exercise (without the indicator of 
whether the answer was correct). 


Due to the short duration of courses, encoding with just five 
types of actions without any additional information leads 
to low variability in data: out of 177 873 sequences, only 
10 058 are unique. As a consequence, the same behaviour 
pattern could correspond to both passing and failing the 
course. To overcome this issue, we consider several ways to 
encode additional information in elements of the sequence. 
We list them by their generalisability, from least to most. 


Concatenating global content item id and the corresponding 
actions. For instance, for a video with id 12, we would en- 
code the action as “video_view_12” (we give examples for 
videos, but the idea transfers directly to other content items 
as well). We can also encode items associated with the video: 
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answering the exercise about this video would be encoded 
as “exercise_answer_12”. The downside is that both these 
approaches do not generalise to new courses: we have to 
assume that the number of either courses or content items 
is fixed. Otherwise, a model needs to be retrained. It is 
a common problem also found in such popular approaches 
as DKT [33]. Besides, this approach quickly leads to an 
inflated vocabulary, making it computationally inefficient if 
one would use the bag-of-words approach (hence it is not 
featured in Table 2 for machine learning models, only for 
recurrent networks). 


Encoding actions using local content item ids, preserving their 
relative order in the course. In other words, if the video 
with id 12 is the first video in a given course, then we en- 
code the action as “video_view_1”. This scheme has the po- 
tential for interesting insights into student behaviour. For 
example, we can see whether consequently watching the 
videos contributes a lot to better performance. Besides, 
we can gauge the engagement and background knowledge 
by checking whether the student immediately watched later 
videos. Regarding the disadvantages, the applicability to 
new courses is still limited: the number of videos should be 
the same or less than in the courses we have seen before. 
Using the out-of-vocabulary token or setting the maximum 
possible number of videos high is the simple workaround. 
However, a more flexible solution might be required — for 
which we suggest the progress percentage encoding. 


Encoding the rounded percentage of the total number of videos 
in the course. For instance, if there are five videos in to- 
tal, then the view of the second one will be represented as 
“video_view_40” (40%). Even though we can no longer focus 
on the exact content item as if the case with the other encod- 
ing schemes, we can still potentially recommend the section 
of the course to revise. This is the most flexible way, as it 
scales to new courses (of arbitrary length, with previously 
unseen exercises) as well. 


We provide experimental results using the four proposed en- 
coding schemes (raw Xa, global content id Xa—gia, local 
content id Xg-jiq and progress percentage id Xg_pia, re- 
spectively) in section 5. To each of the above schemes, we 
can also add the difference in seconds with the previous ac- 
tion if the timestamps are available. 


4. MODELS 


We trained three types of classification models on numeri- 
cal count features X. and session features X;: Logistic Re- 
gression, Decision Tree and Random Forest. For sequence 
data X,, we used two popular variations of RNN — Long- 
Short Term Memory (LSTM) [14] and Gated Recurrent Unit 
(GRU) [4]. We have experimented with convolutional neu- 
ral networks, but their performance was lower than that of 
recurrent ones. As such, for the sake of brevity, we do not 
focus on them in this study. 


On a general level, neural classification models embed ac- 
tion sequences and pass them through recurrent layers to 
the final feed-forward layer(s) with sigmoid activation. It 
produces a probability of success which is then converted 
into a classification prediction score. A recurrent neural net- 
work takes a sequence of vectors {x:}7_, (T is the number 


of timesteps) as an input and maps them to an output se- 
quence {y:}/-1 by calculating hidden states {hi}, which 
encode past information that is relevant for future predic- 
tions. LSTM uses a more elaborate structure in each of the 
repeating cells, allowing it to learn long-term dependencies. 
It includes so-called forget, input and output gates, which 
control what information to retain and pass to the next step. 
GRU simplifies the cell by combining the forget and input 
gates into a single update gate and merges the cell state and 
hidden state. 


An encoder-decoder framework uses an RNN to read a se- 
quence of vectors faye, into another, context vector c in 
an auto-regressive fashion. If we set the desired output se- 
quence equal to the input one — so that the goal becomes 
the reconstruction of the original data — we obtain an au- 
toencoder. Then, the context vector, if its dimensionality 
is chosen to be lower than that of the input, will contain 
a denoised representation of the data that can be used in 
other models. This way, an autoencoder allows us to learn 
efficient data representation in an unsupervised manner. 


Bahdanau attention mechanism allows the network to focus 
on certain parts of the input [2]. It is achieved by computing 
the context vector as a weighted sum of vectors produced by 
the encoder. The weights are learnt by a feed-forward neural 
network jointly with the rest of the model. Transformer 
model is a recent competitive alternative to the encoder- 
decoder framework [34] which foregoes the recurrent cells in 
favour of stacked self-attention and feed-forward layers. 


5. EXPERIMENTS 


We apply machine learning models on count and session 
features and compare them with deep learning ones on se- 
quences of actions (with different encoding schemes used, as 
outlined above). Besides, we embed action sequences using 
an LSTM autoencoder and use its output Xeuto as input for 
the classification models. 


Neural network models were implemented using Keras [5] 
with Tensorflow backend [1] and machine learning ones with 
sklearn [3]. The parameters were optimised using the grid- 
search with stratified 10-fold cross-validation (5-fold for neu- 
ral models). For recurrent neural networks, we checked 
the following parameter values: 32/64/128 recurrent units, 
32/64/128 hidden units in feed-forward layers, 32/64/128 
embedding dimensions. The maximum sequence length was 
set to 50. The models were trained with binary cross-entropy 
loss, using the Adam optimiser [19], early stopping and 
learning rate reduction on a plateau. 


5.1 Classification 

For the binary classification task, the cross-validation ROC 
AUC scores are presented in Table 2. Inspecting the table, 
we can conclude that end-to-end models perform at least 
as well as the ones using manually engineered features, even 
when the length and variability of actions sequences are lim- 
ited. 


Concerning RQ1, the contribution of different encoding schemes, 


we can see that global content id encoding performs the best. 
However, as mentioned above, it does not scale to new con- 
tent items. However, using the percentage encoding scheme 
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Table 2: Cross-validation ROC AUC scores for classification models. Input features: count features X., session features Xs, 
action sequences Xq with encoding scheme variations (no id Xa, global content id Xa—gia, local content id Xa—ia, progress 
percentage id Xq_pia), actions embeddings by LSTM autoencoder Xauzo, actions embeddings by an attentive (Bahdanau) 
LSTM autoencoder Xautog- 


Xe Xs Xa-lid Xa—pid Xauto X auto_B 
Logistic Regression 0.62 0.62 0.69 0.67 0.77 0.76 
Random Forest 0.73 0.81 0.81 0.81 0.82 0.81 
Decision Tree 0.73 0.80 0.79 0.80 0.80 0.80 

Xa Xa—gid Xa lid Xa—pid Xauto Xauto_B 
LSTM 0.73 0.88 0.82 0.81 0.82 0.83 
GRU 0.73 0.87 0.81 0.81 0.83 0.82 
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Figure 1: Distribution of target variable (pass/fail score) over the K-Means clusters based on different data representations 
(with Euclidean distance; K of 3, 5, 10 and 20 were tried). LSTM embeddings distinguish better between high- and low- 
achievers, producing clusters that clearly correspond to one class more than the other, while for traditional count (a) and 
session (b) data, most of the clusters contain a mix of both classes. For LSTM embeddings on action data (c), clusters appear 
that contain more failing students than passing — and are thus more useful for early warning systems. The effect is even more 
pronounced if we use progress percentage encoding (d). 
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also gives a steady increase in ROC AUC score: from 0.73 to 
0.81 for Random Forest, where we use count features, and 
similarly for recurrent networks, where we use sequences. 
Using unsupervised embeddings obtained with an autoen- 
coder (the focus of RQ2), we gain an improvement as well: 
from 0.73 to 0.82 for both machine and deep learning mod- 
els. 


Contrary to our expectations for RQ3, adding attention did 
not significantly improve the results. We experimented with 
including an attention mechanism directly into the classifi- 
cation models, both Bahdanau and Transformer. It can be 
done, for example, by training a simple Transformer, then 
using the average of its encoder’s hidden states as an input 
to a classification model on top. Unfortunately, those mod- 
ifications did not influence performance in our experiments, 
in contrast to [25]. We hypothesise that the features might 
be too coarse-grained and the sequences too short to take 
full advantage of the technique; there are also no skill labels 
that would provide the hierarchical structure that attention 
mechanisms can reflect. However, using attention allows us 
to produce visualisations that a first step to understanding 
the decision-making process of the neural network. Thus, 
we provide the attention scores from a Bahdanau-attention 
classification model below for illustration purposes. 


To gain more actionable insight from our models, we also 
investigated the predictive performance on partial action se- 
quences. Being able to predict early that a student will fail 
the course would allow sending a timely alert to the educa- 
tor, signalling the need for intervention. Hence, we explored 
how the performance of the models changes when only the 
first N actions are available. For these experiments, to en- 
sure that the model does not have full information, we only 
used courses with more than five questions and more than 
eight actions. As depicted in Figure 2, sequential data lead 
to higher scores than count features and proposed encoding 
schemes outperform raw action sequences. 


5.2 Clustering & Visualisation 

We clustered the student-course tuples based on different 
data representations using k-Means and plotted the distri- 
bution of the target class over these clusters (see Figure 1d) 
to investigate how the models distinguish between learning 
strategies of low- and high-achievers. For traditional count 
and session data, most of the clusters are not easily inter- 
pretable, as they contain a mix of both classes, roughly fol- 
lowing the target label distribution. For instance, when us- 
ing count features, almost all student-course pairs are in just 
three clusters, so there is little distinction between learning 
strategies. It should be noted that even though some clus- 
ters appear empty, in fact they still contain a very small 
number of students. 


When we increase the representation’s complexity, the ex- 
tracted groups are more distinct. The distribution of high- 
and low-achievers in them shifts so that clusters with the 
prevalence of a single class emerge. The improvement is 
even more noticeable with progress percentage encoding: 
more clusters are extracted where one class is prevalent. For 
LSTM embeddings on action data, clusters appear that con- 
tain more failing students than passing, signalling that this 
is an unsuccessful strategy (clusters 4 and 6 on 1). This 


information is vital for early warning systems. The effect 
is even more pronounced if we use the progress percentage 
encoding, which encourages the application of these schemes 
for distinguishing between successful and unsuccessful learn- 
ing strategies. 


Another way to shed light on the prediction process of a neu- 
ral network is to visualise attention heatmaps, where higher 
scores correspond to actions in the sequence that are more 
important for the classification decision (Figure 3). 


Limitations & Future work. A wide range of topics covered in 
these courses might influence the performance, as different 
subjects are likely to demand different behaviour patterns 
to successfully pass the course (e.g., humanities versus tech- 
nical subjects). We plan to investigate them separately and 
include information about exercise content. Finally, we plan 
to train a recommendation system informed by the extracted 
learning strategies to aid navigation. 


6. CONCLUSION 


We show that it is possible to predict whether students will 
pass the course using only their navigation pattern sequence 
in the online learning platform, without information about 
past grades. The findings of our study suggest that features 
extracted with deep learning are efficient even if the courses 
are extremely short, cover multiple different subjects, and 
only a limited number of interaction types is available. 


We propose a flexible way to increase the classification per- 
formance with minimal preprocessing of the raw sequences 
of actions extracted from the learning platform required. 
A novel and relatively simple percentage progress encod- 
ing scheme is introduced which captures the course struc- 
ture while scaling well to the new data. It results in an 
improvement of almost 10% in ROC AUC score. We also 
demonstrate a positive effect of using pre-trained unsuper- 
vised embeddings obtained with autoencoders (up to 15% 
improvement when using in machine learning models, com- 
pared to traditional features). We cluster resulting embed- 
dings to show that using action sequences has more potential 
for distinguishing between strategies specific to high and low 
achievers than simple count or session features. It is possible 
to visualise attention heatmaps and see the contribution of 
individual actions to the classification decision to interpret 
retrieved strategies. 


Our research supports decision-makers, as it allows detecting 
(un)successful students from their navigation patterns alone, 
without having to grade intermediate exercises. The action 
sequences corresponding to high achievers can be used to in- 
form learning design patterns and recommendation systems 
in a more meaningful way than the standard count features. 
As the proposed models are shown to outperform several 
baselines on extremely short, incomplete action sequences, 
they allow us to intervene early if a student begins to follow 
a trajectory associated with a lower chance of success. 
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APPENDIX 


Predictive performance on incomplete sequences 
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Figure 2: Comparison of validation ROC AUC scores of 
an LSTM on encoding schemes in the incomplete sequence 
scenario. We can see that percentage and local content id 
encoding schemes perform better than raw actions. As such, 
we would be able to detect whether a student is likely to fail 
from the first two actions already. 


Figure 3: An attention heatmap of the RNN model with 
Bahdanau attention mechanism on action data (multiple se- 
quences view). Higher scores (brighter colours) correspond 
to actions in the sequence that are more important for clas- 
sifying the student as passing or not. If we use an encoding 
scheme which includes content id (such as the global content 
id here), we see which content items contribute more to the 
classification decision: for example, in the bottom row, the 
viewing of the video with id 264231 is more important for 
the network than the others. 
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ABSTRACT 


Attendance rate is an important indicator of students’ study 
motivation, behavior and Psychological status; However, the 
heterogeneous nature of student attendance rates due to the course 
registration difference or the online/offline difference in a blended 
learning environment makes it challenging to compare attendance 
rates. In this paper, we propose a novel method called Relative 
Attendance Index (RAI) to measure attendance rates, which 
reflects students’ efforts on attending courses. While traditional 
attendance focuses on the record of a single person or course, 
relative attendance emphasizes peer attendance information of 
relevant individuals or courses, making the comparisons of 
attendance more justified. Experimental results on real-life data 
show that RAI can indeed better reflect student engagement. 


Keywords 


attendance, peer information, engagement, academic performance, 
comparison, clustering 


1. INTRODUCTION 


While studying offline is the norm for most schools, during 
epidemic periods all or a portion of students are forced to study 
online due to university closure. In such a blended learning 
environment, tracking the study status and the wellbeing of 
students is an important issue for the university. Students’ 
attendance in classes is a measure that reflects students’ 
enthusiasm for the course and their status in the university [29]. 
Many studies suggest a correlation between attendance and 
attainment at university [5, 26, 14]. Several studies detect 
attendance rates using mobile devices and include attendance as a 
feature to predict academic performance [27, 19, 28]. Attendance 
is also correlated with behavior and Psychological problems such 
as video game addiction [24] and depression [26]. Detecting 
unusual attendance rate changes can help to identify abnormal 
behaviors and Psychological problems in an early stage and 
provide in time intervention to students in need. 


The successful applications of attendance data call for fair 
comparisons among peer attendance, especially in universities. 


Pan Deng, Jianjun Zhou, Jing Lyu and Zitong Zhao “Assess- 
ing attendance by peer information”. 2021. In: Proceedings 
of The 14th International Conference on Educational Data Mining 
(EDM21). International Educational Data Mining Society, 400-406. 
https://educationaldatamining.org/edm2021/ 

EDM ’21 June 29 - July 02 2021, Paris, France 


* The corresponding author. 


Traditionally, the attendance rate of a course or a student is 
isolated and might not be compared fairly, because in many 
universities students are allowed to select some courses on their 
own so that the course registration records of two students can be 
different. In addition, each course can have its own attendance 
policy, making it harder to compare attendance rates. Courses that 
have mandatory attendance requirements usually have higher 
attendance rates than those do not, so that it is not fair to compare 
course attendance rates without considering attendance 
requirements. Similarly, students who registered for courses with 
mandatory attendance requirements usually have higher 
attendance rates than students who registered for courses with 
voluntary attendance requirements, so that the attendance rate 
does not always reflect the attainment of a student. Attendance of 
online and offline courses may not be compared directly as well, 
because the efforts to attend those courses can be significantly 
different. Attending online courses could be as simple as a mouse 
click away, while attending offline courses usually requires 
travelling from place to place physically. 


Traditionally it is not easy to fairly compare attendance in a 
university, due to not just the diversity of course registration and 
attendance requirements but also the difficulty of collecting 
campus-wide attendance data. Without attendance information of 
peer students or courses, the attendance data of a student or a 
course is isolated and difficult to adjust. However, in the era of 
Big Data, many new technologies [12, 18, 31] have been proposed 
to collect attendance data for many courses simultaneously, 
making it possible to analyze the attendance structure of the 
student population, and develop new attendance calculation 
methods. 


Careful comparisons of attendance can also provide insight into 
students’ academic interest. If a student attends a course that has a 
generally low attendance rate, it indicates that the student is more 
willing to attend the course than their classmates are; On the other 
hand, if a student attends a course that has a generally high 
attendance rate, it indicates that the student is just doing what 
others are doing. 


In this paper, we propose a novel method called Relative 
Attendance Index (RAI) to measure attendance, which reflects the 
efforts on attending courses and makes comparisons of attendance 
more justified. To our knowledge, this is the first study on fair 
comparisons of attendance. While traditional attendance focuses 
on the record of a single person or course, we define a notion for 
attendance contribution to course attendance and add the 
attendance information of relevant individuals or courses to make 
the comparisons of attendance fairer. We perform a campus-wise 
study on attendance and analyze its effects on course grades and 
GPA. Our experiment results show that RAI has a higher 
correlation with academic performance than the traditional 
attendance rate. 
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The rest of this paper is organized as follows. Section 2 describes 
the related studies. Section 3 introduces the RAI definition. 
Section 4 presents the experiment results on real-life data from a 
university. Section 5 discusses an application of RAI on clustering 
student populations. Section 6 describes the limitations and future 
work. Section 7 lists the acknowledgments. 


2. RELATED STUDIES 


With the advancement of technology, many new methods have 
been proposed to collect campus-wide attendance data. Several 
studies [13, 11, 18] measured attendance via QR code systems in 
which QR codes are generated and then scanned by students to 
authenticate themselves. Wang et al. [26] deployed an APP to 
students’ cell phones to detect attendance by GPS signals and 
WiFi tracing. A method independently developed in [19] and [28] 
used WiFi log to calculate attendance. Studies in [2, 25] proposed 
Bluetooth/Beacon based attendance prediction systems. Shoewu 
and Idowu [22] used fingerprints and Kar et al. [12] used face 
recognition to detect individual attendance rates. 


Some studies measured offline and online attendance at the same 
time. Brennan et al. [4] detected physical attendance by thermal 
sensors and online behaviors by clickstream data. The change of 
online and physical attendance through time was observed. 
However, due to a technology limit, the method did not link 
physical attendance to individuals and did not study the issue of 
attendance comparison. Nordmann et al. [17] mixed the data of 
physical attendance and online recording clickstream together to 
form the total attendance rate instead of studying them separately. 
However, attendance of live lectures is still a stronger predictor 
than recording use on students’ academic performance. 


Many studies confirmed the correlation between attendance and 
academic performance [1, 3, 7, 16]. See [15] for a survey. [15] 
also reviewed factors that affect attendance. To work around the 
issue of fair comparisons of attendance, many studies focused on 
samples from the same course or samples with similar registration 
records (e.g., first year students) [1, 3, 10]. [3] also controlled 
factors such as age, gender, nationality etc. in their regression 
analysis. Studies in [7] and [16] divided the students into bands 
according to grades and used the average attendance of each band 
for correlation studies. 


Student subtyping and clustering are widely used in analyzing 
learning process and predicting academic performance. Yang et 
al. [30] applied EM-IRL to students learning behavior data and 
observed significant differences between groups. Romero et al. 
[20] used clustering on online forum data to predict students’ final 
performance. Resulting model turned out to be suitable and highly 
interpretable. Cerezo et al. [5] studied both learning process and 
clusters’ relation with performance using LMS logs data. 
Resulting clusters are well-interpreted and showed satisfying 
correlation with final marks. 


Many studies explored the reasoning for student class attendance. 
Friedman et al. [9] and Moore et al. [14] reported positive 
relationship between class attendance and students’ motivation. 
Sloan et al. [23] further found that the level of interest has 
significant impact on attendance. These studies indicated that 
attendance, along with other features, can better show students’ 
academic interest than traditional models. 


None of the above studies has applied peer information to revise 
attendance measurements. 


3. METHOD 

The traditional attendance rate of a class or a student is defined in 
a straightforward way. Only the information of the class or the 
student is involved. We give the definition of Attendance Rate 
(AR) formally as in Definition 1. 


Definition 1 (Attendance Rate r. and r;): Given class c and 
student s, let n-°4 and n&“ be the number of students registered c 
and the number of students attended c respectively; let n, 4 and 
n@t be the number of classes registered by s and the number of 
classes attended by s. Then the Attendance Rate (AR) of class c 
(rc) and of student s (rs) are defined as below respectively. 


nate 


uQ= nee (1) 
natt 
m= rer (2) 


Ss 


Students can have different sets of registered classes, and classes 
can have very different attendance requirements. When comparing 
the attendance rates of two students, it is necessary to analyze the 
set of classes attended by these two students and the attendance 
rates of these classes. If a student attends a class attended by 
almost everyone, the student makes little contribution to the 
attendance rate of the class; on the other hand, if a student attends 
a class that has a low attendance rate, the student makes a 
significant contribution to the attendance rate. To capture the 
concept, we propose the notion of attendance contribution as in 
Definition 2. 


Definition 2 (Attendance Contribution D,,): Let r- be the 
attendance rate of class c, and dsc be a function indicting whether 
student s attended class c or not, then the Attendance Contribution 
of student s on the attendance rate of class c is defined as 


Dse = Ase — Tes with (3) 


a =| 1, ifs attended c (4) 
me 0. ifs did not attendc 


Since 1, € [0,1], Attendance Contribution is a number between 


-1 and 1. If s has registered c and s attended c, then the 
attendance rate of c cannot be zero and D,, can approach 1 but 
never reach 1. 


With Attendance Contribution, we can compare the attendance 
rates of two students by computing the average Attendance 
Contribution on registered classes. We defined the notion as 
Relative Attendance Index (RAI) in Definition 3. 


Definition 3 (Relative Attendance Index RAI,): Given student s, 
Let K, be the set of classes registered by s, the Relative 
Attendance Index (RAJ) of s is defined as 


= Leeks Dsc 
RAL, = SHs?es (5) 


RAI considers both the student’s individual attendance status of a 
semester and the attendance status of the student’s classmates. 
The peer information is injected into the new measure through the 
course attendance rate in Attendance Contribution. 


LEMMA 1: —1 < RAI, < 1. 
Proof: The RAI, definition only considers classes registered by s. 
When a;-=0, %€[0,1) ; When a,, =1, s attendedc, 


therefore 7, € (0,1] . Thus ds. — 1%; €(—1,1), Therefore 
pa Yeeks(Ase—1e) IKs|X-1 |Ks|X1\ _ = 
RAI, IKI ¢ ren? KA )=(-1, 1). 


RAI is a number between -1 and 1. When RAI approaches -1, as. 
is mostly 0 and r, approaches 1 for most classes, indicating that 
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the student has skipped many well attended classes. On the other 
hand, when RAI approaches 1, a,, is mostly 1 andr, approaches 
0 for most classes, indicating that the student has attended many 
poorly attended classes. Therefore, RAI shows the difference of 
attitudes toward classes between students and their classmates. 


4. Results 
4.1 Data and Setup 


The anonymous attendance and grade data used in this paper were 
collected in 2018 and 2019 from a university! in China. We 
applied the method proposed in [19] and [28] to calculate the 
attendance. Note that our Relative Attendance Index can be 
applied to attendance data collected using other methods such as 
QR code [18]. The student IDs were converted into hash codes, 
then the attendance and grade data were connected through the 
hash codes. The university did not have a mandatory attendance 
policy; however, while most instructors followed the university 
policy, some instructors had their own attendance requirements. 
Some instructors used in-class discussions and quizzes to 
encourage attendance. The data came from 4838 students from 
Cohort 14 to 19 in 44 majors, spanning over 3 semesters, with 
1489 courses grouped into 37 categories by the university. For 
most courses, students received letter grades from A to F. Courses 
with other grades such as P/F were excluded. The traditional 
attendance rate (AR) for a student was calculated using Formula 
(2) in Definition 1, and the corresponding RAI was calculated 
using Formula (5) in Definition 3. 


4.2 Correlation with Academic Performance 
Many previous studies show that attendance is correlated with 
academic performance. Given that the purpose of student 
attendance comparison is usually to assess the attainment of the 
students, we calculated the correlation between attendance rates 
and GPA to assess the fairness of attendance comparisons. A 
more correctly calculated attendance assessment will have a 
higher correlation with the GPA. The Pearson correlation between 
RAI and GPA is 0.48, which is significantly higher than that of 
AR (0.37). The p-values of the two correlation values are 
3.7x107> and 2.6x10"!? respectively. Since they are well below 
the 0.05 threshold, the correlation values are generally considered 
significant. 


We also calculated the correlation between attendance and 
academic performance within each course category. The result is 
shown in Table 1 (sorted by the RAI correlation). Some course 
categories were filtered out because they had small enrollments 
and did not generate correlation values with low enough p-values 
(<0.05) to be statistically significant. For 19 out of the 26 course 
categories, RAI has a higher correlation than AR. Only for two 
categories, GED and FRN, AR has a higher correlation than RAI 
(The descriptions of the categories are listed in Table 1). AR and 
RAI are tie for the five categories of FMA, GNB, ERG, CHM and 
CSC. We remark that language related courses such as ENG 
(English) have low correlations because those courses usually 
have in-class discussions resulting in an unofficial mandatory 
attendance requirement. Categories that rely on prior knowledge 
in high school, such as Chemistry and Physics, also have low 


' The use of the data by our project has been approved by the 
university management and the committee in charge of personal 
information in this university. 


correlations. Since most students attending this university did not 
study Calculus in high school and Calculus accounts for a large 
faction in Mathematics courses, it is reasonable to see the MAT 
category having a much higher correlation. 


Table 1. Correlation in course categories 


CAT. Description AR | RAI 
FMA | Financial Mathematics 0.65 | 0.65 
MSE | Material Science and Engineering 0.46 | 0.52 
BIM | Bioinformatics 0.35 | 0.51 
GED | General Education D 0.48 | 0.46 
GEB | General Education B 0.42 | 0.45 
STA | Statistics 0.32 | 0.39 
MGT | Management 0.26 | 0.39 
GEN GE Foundation: In Dialogue with 0.28 | 037 
Nature 
FIN | Finance 0.34 | 0.36 
MAT | Mathematics 0.34 | 0.36 


EJE | Electronic Information Engineering 0.34 | 0.36 


GE Foundation: In Dialogue with 


GFH Humanity 0.34 | 0.36 
ECO | Economics 0.27 | 0.36 
GNB | Genomics and Bioinformatics 0.35 | 0.35 
GEA | General Education A 0.32 | 0.35 
ACT | Accounting 0.32 | 0.34 
PHY | Physics 0.22 | 0.29 
HSS | Humanities and Social Science 0.27 | 0.28 
ERG | General Engineering courses 0.27 | 0.27 
GEC | General Education C 0.24 | 0.27 
CHM | Chemistry 0.25 | 0.25 
FRN | French 0.28 | 0.23 
CSC _ | Computer Science 0.23 | 0.23 
MKT | Marketing 0.12 | 0.22 
CHI | Chinese 0.13 | 0.19 
ENG | English 0.06 | 0.12 


4.3 RAI Distribution 

To illustrate the different distributions on RAI for high and low 
course grade students, we collected two sets of samples, with one 
set having a course grade no less than B+ and the other set no 
greater than C. Each sample is a triplet with a hashed student ID, a 
course ID, and the corresponding grade received by the student in 
the course. We then calculated the RAI of the student in the 
corresponding course. Figure 1 (a) shows the distribution of the 
first set. It shows that more than 50% of samples have RAI > 0 
(better than normal). Figure 1 (b) shows the distribution of the low 
course grade set. It shows that the majority of samples have RAI 
< 0 (worse than normal), with some down to -0.8. For easier 
comparisons of both sets, the values in both subfigures have been 
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Figure 1. RAI of high and low course grade samples. 


normalized as the proportion values. We can see that the first set 
has a more concentrated distribution than the second set. This 
indicates that students receiving grade C or lower have a much 
higher probability of having extreme attendance behaviors 
(skipping many courses). 


5. DISCUSSION 


In this section, we showcase an application of RAI on clustering 
the student population. 


We formatted the attendance values in the 37 course categories as 
a vector for each student, then applied a clustering algorithm on 
the vectors. Since the dimensionality of 37 is too high for most 
clustering algorithms, we applied PCA to reduce the 
dimensionality. The clustering algorithm we applied was the 
DBSCAN clustering algorithm [8] using the Euclidean distance. 
DBSCAN performs density-based clustering and does not require 
the input of the cluster number. The parameters we tuned in this 
experiment are specified in Table 2. We applied silhouette score 
[21] to select the best set of parameters with the highest silhouette 
score. 


Table 2. Parameters to tune for the clustering. 


Parameters Range 

Number of PCA components [5, 6, 7, 8, 9, 10, 11, 12, 13, 
14, 15] 

Eps of DBSCAN (0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 
0.7, 0.8, 0.9, 1.0] 

MinPoints of DBSCAN [5, 6, 7, 8, 9, 10, 11, 12, 13, 


14, 15, 16, 17, 18, 19, 20] 


We applied the same clustering procedure to AR and RAI 
attendance values respectively. For AR, the procedure failed to 
generate meaningful clustering (the result contained one big 
cluster only). For RAI the procedure produced 8 clusters with 616, 
472, 665, 520, 130, 1799, 61, and 549 students respectively. 26 
student samples were labeled as noise by DBSCAN and excluded 
in the follow-up study. For each cluster, we identified the top five 
most popular majors among the samples in the cluster to analyze 
students’ academic interest and performance. 


Figure 2 shows the profiles of the 8 clusters from the RAI 
clustering, labeled as cluster 1 to 8. We combined the dimensions 
of student numbers, distribution of majors, RAI attendance rates, 
top 10% academic performance ratio, and last 10% academic 
performance ratio to show how RAI attendance is related with 
academic performance and how the analysis can provide guidance 
on major selection. While some of the findings are interesting, we 
admit that not all phenomena can be fully explained due to the 
complexity behind attendance and attainment [15]. For all 
subfigures in Figure 2, the X is the major of the students. Figure 
2(a) shows the number of students in each major for the 8 clusters 
in arow. Some of the clusters are very specific. Cluster 5 contains 
two majors only, TRAN (Translation) and PSY (Psychology); 
Cluster 7 contains the major of FE (Financial Engineering) only. 
Figure 2(b) shows the distribution of majors among the clusters 
(whether a cluster accounts for a significant portion of the 
students in a major), with each bar representing a fraction of the 
corresponding major in the university. For example, as shown in 
Figure 2(b), close to 70% of the students majoring in PSY are in 
cluster 5; close to 50% of the students majoring in CSE 
(Computer Science and Engineering) are in Cluster 6, with other 
large portions of CSE students in Cluster 2, 3 and 4. Figure 2(c) 
shows the RAI attendance of the clusters. Students in Cluster 6 
have significantly lower RAI values than the other clusters. Figure 
2(d) shows the ratio of students with a GPA in the top 10% of the 
major. If a bar of major m is higher than the 0.1 line, it means that 
the students from the cluster in major m outperform the average 
level of students in major m. Similarly, Figure 2(e) shows the 
portion of students with a GPA in the last 10% of the major. The 
higher the value, the worse the performance of the students, which 
is the opposite of Figure 2(d). 


Figure 2 illustrates how RAI correlates with academic 
performance. Figure 2(c) shows that Cluster 6 has the overall 
lowest RAI values, with all the five majors having negative RAI 
values. Cluster 6 also has the worst top 10% ratio in Figure 2(d) 
(only one major is barely over the average cutline), and the worst 
last 10% ratio in Figure 2(e) (all five majors worse than the 
average). The TRAN major has about the same number of 
students in Cluster 5 and Cluster 8. The TRAN in Cluster 8 has a 
higher RAI value as well as a higher top 10% ratio and a much 
lower last 10% ratio than TRAN in Cluster 5. There are 
exceptions though. CSE in Cluster 3 has a negative RAI, but its 
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top 10% ratio is the highest in Cluster 3. However, this is 
consistent with our result in Table 1, which shows that CSE 
courses have a relatively low RAI correlation with academic 
performance (CSE major students usually take many CSE 
courses). 


Another interesting finding is that in all the 7 clusters with more 
than one major, the major that has the highest RAI value also has 
the lowest last 10% ratio except for Cluster 2. In Cluster 2 it is the 
second highest RAI major EIE that has the lowest last 10% ratio. 
The highest RAI major in Cluster 2 is BIFC (Bioinformatics), a 
new major with a relatively small enrollment. Students facing the 
risk of poor academic performance may consider selecting or 
switching to the major with the highest RAI in the same cluster. 
While we admit that this is by-no-mean a correlation between RAI 
and students’ academic interest, we remark that the interest in a 
subject is generally believed to be a weapon to fight against poor 
performance. Together with the fact that DBSCAN worked better 
on RAI than AR, we believe this phenomenon may suggest that 
RAI has a better potential than AR for exploring students’ 
academic interest. 


300 
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6. LIMITATIONS AND FUTURE WORK 


In this study, we defined the Relative Attendance Index to adjust 
the attendance measurement, with the objective of better 
reflecting students’ attainment and interest. While attendance is 
affected by many factors [15], the new information we introduced 
is only the attendance of the peer. Further improvements should 
address more factors of attendance. 


When clustering on student data, the clustering algorithm 
DBSCAN worked better on RAI than AR data, and the clustering 
analysis confirmed the correlation between RAI and academic 
performance. We admit that we have not been able to confirm the 
correlation between RAI and students’ academic interest, which is 
an interesting topic to be further explored. 


The raw attendance data was collected using a WiFi based method 
{19, 28]. It is possible that some students closed the WiFi 
connection on their cell phones or even closed their cell phones all 
together before class, leading to a false label of absence. If a 
student had less than 50 WiFi connection records in a week, their 
data in that week were excluded from the statistics. While we 
admit that this could generate some noise in the attendance, we 
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Figure 2. Student Clustering. 
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observed that this situation only occurred rarely. For example, in 
Fall term 2019 we found that only 2 out of 4838 students turned 
off WiFi completely, and only 3.34% of the students had some 
weeks of data filtered out. We conjecture two reasons for this 
phenomenon. First, usage of laptops and tablets is popular among 
the students in this university. Many students carry them to the 
classroom to view course materials. Laptops and tablets usually 
can connect WiFi only. Secondly, students use time confetti 
before and after class to chat with friends or read the news. It will 
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ABSTRACT 


Traditionally, clustering algorithms focus on partitioning the 
data into groups of similar instances. The similarity objec- 
tive, however, is not sufficient in applications where a fair- 
representation of the groups in terms of protected attributes 
like gender or race, is required for each cluster. Moreover, 
in many applications, to make the clusters useful for the 
end-user, a balanced cardinality among the clusters is re- 
quired. Our motivation comes from the education domain 
where studies indicate that students might learn better in 
diverse student groups and of course groups of similar car- 
dinality are more practical e.g., for group assignments. To 
this end, we introduce the fair-capacitated clustering prob- 
lem that partitions the data into clusters of similar instances 
while ensuring cluster fairness and balancing cluster cardi- 
nalities. We propose a two-step solution to the problem: i) 
we rely on fairlets to generate minimal sets that satisfy the 
fair constraint and ii) we propose two approaches, namely 
hierarchical clustering and partitioning-based clustering, to 
obtain the fair-capacitated clustering. Our experiments on 
three educational datasets show that our approaches deliver 
well-balanced clusters in terms of both fairness and cardi- 
nality while maintaining a good clustering quality. 


Keywords 
fair-capacitated clustering, fair clustering, capacitated clus- 
tering, fairness, learning analytics, fairlets, knapsack. 


1. INTRODUCTION 


Machine learning (ML) plays a crucial role in decision-making 
in almost all areas of our lives, including areas of high soci- 
etal impact, like healthcare and education. Our work’s mo- 
tivation comes from the education domain where ML-based 
decision-making has been used in a wide variety of tasks from 
student dropout prediction [9], forecasting on-time gradua- 
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tion of students [15] to education admission decisions [21]. 
Recently, the issue of bias and discrimination in ML-based 
decision-making systems is receiving a lot of attention [28] 
as there are many recorded incidents of discrimination (e.g., 
recidivism prediction [20], grades prediction [4, 14]) caused 
by such systems against individuals or groups or people on 
the basis of protected attributes like gender, race etc. Bias 
in education is not a new problem. There is already a long 
literature on different sources of bias in education [24] or stu- 
dents’ data analysis [3] as well as studies on racial bias [31] 
and gender bias [22]. However, ML-based decision-making 
systems have the potential to amplify prevalent biases or cre- 
ate new ones and therefore, fairness-aware ML approaches 
are required also for the educational domain. 


In this work, we focus on fairness in clustering, as in edu- 
cational activities, group assignments [8] and student team 
achievement divisions [30] are important tools that help stu- 
dents working together towards shared learning goals. Clus- 
tering is an effective solution for partitioning students into 
groups of similar instances [3, 26]. Traditional algorithms, 
however, focus solely on the similarity objective and do not 
consider the fairness of the resulting clusters w.r.t.  pro- 
tected attributes like gender. However, studies indicate that 
students might learn better in diverse groups, e.g., mixed- 
gender groups [11, 32]. Lately, fair-clustering solutions have 
been proposed [1, 2, 5, 6], which aim to discover clusters with 
a fair representation regarding some protected attributes. In 
this work, we adopt the cluster fairness of [6], called clus- 
ter balance, according to which protected groups must have 
approximately equal representation in every cluster. 


In a teaching situation, it is obvious that the size of the 
groups should be comparable to allow a fair allocation of 
work among the students. As traditional clustering algo- 
rithms do not consider this requirement, clusters of varying 
sizes might be extracted, reducing the usefulness and ap- 
plicability of the partitioning for end-users/teachers. This 
leads to the demand for clustering solutions that also take 
into account the size of the clusters. The problem is known 
as capacitated clustering problem (CCP) [25] which aims to 
extract clusters with a limited capacity while minimizing 
the total dissimilarity in the clusters. Capacitated cluster- 
ing is useful in quite a few applications such as transfer- 
ring goods/services from the service providers (post office, 
stores, etc.), garbage collection and sales force territorial de- 
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sign [27]. To the best of our knowledge, no solution exists 
that considers both fairness and capacity of clusters on the 
top of the similarity objective. 


To this end, we propose a new problem, the so-called fair- 
capacitated clustering that ensures fairness and balanced 
cardinalities of the resulting clusters. We decompose the 
problem into two subproblems: i) the fairness-requirement 
compliance step that preserves fairness at a minimum thresh- 
old of balance score and ii) the capacity-requirement com- 
pliance step that ensures clusters of comparable sizes. For 
the first step, we generate fairlets [6], which are minimal sets 
that satisfy fair representation w.r.t. a protected attribute. 
For the second step, we propose two solutions for two differ- 
ent clustering types, namely hierarchical and partitioning- 
based clustering, that consider the capacity constraint dur- 
ing the merge step (hierarchical approach) /assignment step 
(partitioning approach). Experimental results, on three real 
datasets from the education domain, show that our methods 
result in fair and capacitated clusters while preserving the 
clustering quality. 


2. RELATED WORK 


Chierichetti et al. [6] introduced the fair clustering problem 
with the aim to ensure equal representation for each pro- 
tected attribute, such as gender, in every cluster. In their 
formulation, each instance is assigned with one of two colors 
(red, blue). They proposed a two-phase approach: clus- 
tering all instances into fairlets - small clusters preserving 
the fairness measure, and then applying vanilla clustering 
methods on those fairlets. Subsequent studies focus on gen- 
eralization and scalability. Backurs et al. [1] presented an 
approximate fairlet decomposition algorithm which can for- 
mulate the fairlets in nearly linear time thus tackling the effi- 
ciency bottleneck of [6]. Résner and Schmidt [29] generalized 
the fair clustering problem to more than two protected at- 
tributes. A more generalized and tunable notion of fairness 
for clustering was introduced in Bera et al. [2]. Anshuman 
and Prasant [5] introduced a fair hierarchical agglomerative 
clustering method for multiple protected attributes. 


The capacitated clustering problem (CCP), a combinatorial 
optimization problem, was first introduced by Mulvey and 
Beck [25] who proposed solutions using heuristic and sub- 
gradient algorithms. Several approaches exist to improve 
the efficiency of solutions or CCP approaches for different 
cluster types. Khuller and Sussmann [17], for example, in- 
troduced an approximation algorithm for the capacitated 
k-Center problem. Geetha et al. [10] improved k-Means 
algorithm for CCP by using a priority measure to assign 
points to their centroid. Lam and Mittenthal [19] proposed 
a heuristic hierarchical clustering method for CCP to solve 
the multi-depot location-routing problem. 


In this work, we introduce the problem of fair-capacitated 
clustering which builds upon the notions from fair cluster- 
ing and capacitated clustering. In particular, we build upon 
the notion of fairlets [6] to extract the minimal sets that 
preserve fairness. Regarding the CCP we follow the formu- 
lation of [25] to ensure balanced cluster cardinalities. To the 
best of our knowledge, the combined problem has not been 
studied before and as already discussed, comprises a useful 
tool in quite a few domains like education. 


3. PROBLEM DEFINITION 

Let X € R” be a set of instances to be clustered and let 
d() : X x X +R be the distance function. For an integer k 
we use [k] to denote the set {1,2,...,k}. A k-clustering C is 
a partition of X into k disjoint subsets, C = {C1, C2,..., Cr}, 
called clusters with S = {s1, s2,..., 8%} be the corresponding 
cluster centers. The goal of clustering is to find an assign- 
ment 6: X — [k] that minimizes the objective function: 


L(X0C)= SY dws) (1) 


s,ES rECy 


As shown in Eq. 1, the goal is to find an assignment that 
minimizes the sum of distances between each point © € 
Xand its corresponding cluster center s; € S. It is clear 
that such an assignment optimizes the similarity but does 
not consider fairness or capacity of the resulting clusters. 


Capacitated clustering: The goal of capacitated clustering 
[25] is to discover clusters of given capacities while still min- 
imizing the distance objective £(X,C). The capacity con- 
straint is defined as an upper bound Q, on the cardinality 
of each cluster C;: 


ICi| < Q: (2) 


Clustering fairness: We assume the existence of a binary 
protected attribute P = {0,1}, e.g., gender={male, female}. 
Let w : X — P denotes the demographic group to which the 
point belongs, i.e., male or female. 


Fairness of a cluster is evaluated in terms of the balance 
score [6] as the minimum ratio between two groups. 
- : xeEC;|p(2)=0 LEC; |p(x)=1 
balance(Cs) = min (HESS San Hacestets=oH ) 
(3) 
Fairness of a clustering C equals to the balance of the least 
balanced cluster C; € C. 


balance(C) = gnin, balance(C;) (4) 


We now introduce the problem of fair-capacitated cluster- 
ing that combines all aforementioned objectives regarding 
distance, fairness and capacity. 


Definition 1. (Fair-capacitated clustering problem) 

We define the problem of (t, k, q)-fair-capacitated clustering 
as finding a clustering C = {C1,C2,--- ,C,} that partitions 
the data X into |C| = k clusters such that the cardinality 
of each cluster C; € C does not exceed a threshold q, ie., 
|Ci| < q (the capacity constraint), the balance of each cluster 
is at least t, i.e., balance(C) > t (the fairness constraint), 
and minimizes the objective function L(X,C). Parameters 
k,t,q are user defined referring to the number of clusters, 
minimum balance threshold and maximum cluster capacity, 
respectively. 


4. FAIR-CAPACITATED CLUSTERING 


4.1 Fairlet decomposition 

Traditionally, the vanilla versions of clustering algorithms 
are not capable of ensuring fairness because they assign the 
data points to the closest center without the fairness con- 
sideration. Hence, if we could divide the original data set 
into subsets such that each of them satisfies the balance 
threshold t then grouping these subsets to generate the final 
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clustering would still preserve the fairness constraint. Each 
fair subset is defined as a fairlet. We follow the definition of 
fairlet decomposition by [6]. 


Definition 2. (Fairlet decomposition) 
Suppose that balance(X) > t with t = f/m for some in- 
tegers 1 < f < m, such that the greatest common divisor 
gcd(f,m) = 1. A decomposition F = {F, Fo,..., Fi} of X 
is a fairlet decomposition if: i) each point x € X belongs to 
exactly one fairlet F; € F; ii) |F;| < f+ m for each F; € F, 
i.e., the size of each fairlet is small; and iii) for each F; € F, 
balance(F;) > t, i.e., the balance of each fairlet satisfies the 
threshold t. Each Fj is called a fairlet. 


By applying fairlet decomposition on the original dataset 
X, we obtain a set of fairlets F = {Fi, Fo,..., Fi}. For 
each fairlet Fj; we select randomly a point r € F; as the 
center. For a point x € X, we denote y : X — [1,]] 
as the index of the mapped fairlet. The second step, is 
to cluster the set of fairlets F into k final clusters, sub- 
ject to the cardinality constraint. The clustering process 
is described below for the hierarchical clustering type (Sec- 
tion 4.2) and for the partitioning-based clustering type (Sec- 
tion 4.3). Clustering results in an assignment from fairless 
to final clusters: 6 : F — [k]. The final fair-capacitated 
clustering C can be determined by the overall assignment 
function $(x) = 6(Fy(a)), where y(a#) returns the index of 
the fairlet to which x is mapped. 


4.2 Fair-capacitated hierarchical clustering 
Given the set of fairlets: F = {Fi, Fo,..., Fi}, let W = 
{wi, w2,...,wi} be their corresponding weights, where the 
weight w,; of a fairlet F; is defined as its cardinality, i-e., 
number of points in F;. 


Traditional agglomerative clustering approaches merge the 
two closest clusters, so rely solely on similarity. We extend 
the merge step by also ensuring that merging does not vio- 
late the cardinality constraint w.r.t. the cardinality thresh- 
old q. 


THEOREM 1. The balance score of a cluster formed by the 
union of two or more fairlets, is at least t. 


balance(Y) > t, where Y = Uj;<iF; and balance(F;) > t 


PROOF. We use the method of induction to derive the 
proof. Assume we have a set of fairlets F = {Fi, Fo,..., Fi}, 
in which, balance(F;) >t, j = 1,...,l. We first con- 
sider the case for any two fairlets {F1, 2} € F. We have 


balance(F,) = fo > t and balance(F2) = oo >t. We 
1 


2 
denote by Y is the union of two fairlets F, and F», then 


balance(Y) = balance(F, U F2) = At he (5) 
my, + mg 
It holds: 
fi >t or, fi 2 = 
m4 mi +m2 —~ m1, + me 
wg 88 fe tme 
Similarly, = 
mi + me my + m2 (6) 
fi fe S tm tme2 
mitme.  mitm2~ mi+me | mit+me 
fit fe s t(mi + m2) = 
mitm2.~ mi+me 


Therefore, from Eq. 5 and Eq. 6 we get, 
balance(Y) >t (7) 


Thus, the statement given in Theorem 1 is true for any clus- 
ter formed by the union of any two fairlets. Now we assume 
that the statement holds true for a cluster formed from i 
fairlets, i.e, Y = Uj<i Fj, where 1 <i < J. Then, 


disihi . t (8) 


balance(Y) = Ta 
5a M5 


Consider another fairlet F;41 € F which is not in the formed 


cluster Y, balance(Fi+1) = Finoe >t. Then, by joining Fi+1 


Mit1 
with the cluster Y we get the new cluster y such that, 
fit + ae fj 


balance(Y ) = 


(9) 


Misi + Ki mj; 

Following the steps in Eq. 6, we can similarly show that, 
fini + Dijes fi 
Mit+1 + i<s M5 


Hence, the theorem holds true for cluster formed with i + 1 
fairlets if it is true for 7 fairlets. Since 7 is any arbitrary num- 
ber of fairlets, thus the theorem holds true for all cases. 


>t balance(y ) >t (10) 


Theorem 1 shows that for any cluster formed by union of 
fairlets, the fairness constraint is always preserved. Hence- 
forth, we don’t need further interventions w.r.t. fairness. 


The pseudocode is shown in Algorithm 2 of Appendix B. 
In each step, the closest pair of clusters is identified and a 
new cluster is created only if its capacity does not exceed 
the capacity threshold g. Otherwise, the next closest pair 
is investigated. The procedure continues until k clusters 
remain. The remaining clusters are fair and capacitated ac- 
cording to the correponding thresholds t and g. To compute 
the proximity matrix (line 1 and line 8), we use the distance 
between centroids of the corresponding clusters. The func- 
tion capacity(cluster) in line 5 returns the size of a cluster. 


4.3 Fair-capacitated partitioning clustering 
Partitioning-based clustering algorithms, such as k-Medoids, 
can be viewed as a distance minimization problem, in which, 
we try to minimize the objective function in Eq. 1. The 
vanilla k-Medoids does not satisfy the cardinality constraint 
since the allocating points to clusters step is only based on 
the distance among them. Now, if we change the goal of this 
assignment step to find the “best” data points with a defined 
capacity for each medoid instead of searching for the most 
suitable medoid for each point, we can control the cardinality 
of clusters. We formulate the problem of assigning points 
to clusters subject to a capacity threshold q as a knapsack 
problem [23]. 


Let S = {s1, 52,..., 8x} be the cluster centers, i.e., medoids, 
andC = {C1, C2, ..., Cy} be the resulting clusters. We change 
the assignments of points to clusters, using knapsack, in or- 
der to meet the capacity constraint g. In particular, we 
define a flag variable y; = 1 if x; is assigned to cluster Ci, 
otherwise y; = 0. Now, we define a value v; to data point 
x ;, which depends on the distance of x; to Ci, with v; being 
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maximum if C; is the best cluster for x;, i.e, the distance 
between x; and s; is minimum. We formulate value v; of 
instance x; based on an exponential decay distance function: 


(11) 


where d(x;, 8:) is the Euclidean distance between the point 
x; and the medoid s;. The higher is the lower the effect 
of distance in the value of the points. The point which is 
closer to the medoid will have a higher value. 


vj = a= fxd(x; 83) 


Then, the objective function for the assignment step is: 


n 
maximize y UY; 
j=l 


(12) 


Now, given F = {Fi, Fo,..., Fi} and W = {wi, we,..., wi} 
are the set of fairlets and their corresponding weights, i.e., 
the number of instances in the fairlet, respectively; q is the 
maximum capacity of the final clusters. Our target is to 
cluster the set of fairlets F into k clusters centered by k 
medoids. We apply the formulas in Eq. 11 and Eq.12 on the 
set of fairlets F, i.e, each fairlet Fj has the same role as «;. 
Then, the problem of assigning the fairlets to each medoid in 
the cluster assignment step becomes finding a set of fairlets 
with total weights less than or equal to q and the total value 
is maximized. In other words, we can formulate the cluster 
assignment step in the partitioning-based clustering as a 0-1 
knapsack problem. 


l 
maximize y UsiYj 
j=1 


(13) 


subject to So wy; <q and y; € {0,1} 


j=l 


In which, y; is the flag variable for F;, y; = 1 if F; is assigned 
to a cluster, otherwise y; = 0 ; v; is the value of F; which is 
computed by the Eq. 11; q is the desired maximum capacity. 


The pseudocode of our k-Medoids fair-capacitated algorithm 
is described in Algorithm 2. In which, for each medoid we 
would search for the adequate points (line 3) by using func- 
tion knapsack(p, values, w,q) (line 10) implemented using 
dynamic programming, which returns a list of items with a 
maximum total value and the total weight not exceeding q. 
In the main function, line 12, we optimize the clustering cost 
by replacing medoids with non-medoid instances when the 
clustering cost is decreased. This optimization procedure 
will stop when there is no improvement in the clustering 
cost is found (lines 19 to 32). 


5. EXPERIMENTS 


In this section, we describe our experiments and the perfor- 
mance of our proposed methods on three educational datasets. 


5.1 Experimental setup 
Datasets. The datasets are summarized in Table 1. 
UCI student performance. This dataset includes the de- 


mographics, grades, social and school-related features of stu- 
dents in secondary education of two Portuguese schools [7] in 
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Algorithm 1: k-Medoids fair-capacitated algorithm 
Input: F = {F1, F,..., Fi}: a set of fairlets 
W = {wi,we,..., wi}: weights of fairlets 
q: a given maximum capacity of final clusters 
k: number of clusters 
Output: A fair-capacitated clustering 
Function ClusterAssignment (medoids): 
clusters < Q; 
for each medoid s in medoids do 
candidates < all fairlets which are not assigned 
to any cluster ; 
p < length(candidates) ; 
w < weights(candidates) ; 
for each fairlet; in candidates do 
| values[i] << v(fairlet;) //Eq. 11 ; 
end 
cluster s[s|<-knapsack(p, values, w, q) ; 


end 
return clusters; 
Function main(): 
medoids + select k of the | fairlets arbitrarily ; 
ClusterAssignment(medoids) ; 
costyest < Current clustering cost; 
Sbest <— null ; 
Obest <— Null ; 
repeat 
for each medoid s in medoids do 
for each non-medoid o in F do 
consider the swap of s and 0, compute 
the current clustering cost; 
if current clustering cost < coStyest then 
Sbest <— 8} 
Obest <— O; 
costpest < current clustering cost; 
end 


end 


end 
update medoids by the swap of spese and Ovest } 
ClusterAssignment(medoids) 


until no improvements can be achieved by any 
replacement; 
return clusters; 


2005 - 2006. “Gender” is selected as the protected attribute, 
i.e., we aim to balance gender in the resulting clusters. 


Open University Learning Analytics (OULAD). This 
is the dataset from the OU Analyse project [18] of Open 
University, England in 2013 - 2014. Information of students 
includes their demographics, courses, their interactions with 
the virtual learning environment, and final outcome. We 
aim to balance the “Gender” attribute in the results. 


MOOC. The data covers students who enrolled in the 16 
edX courses offered by the two institutions (Harvard Univer- 
sity and the Massachusetts Institute of Technology) during 
2012 - 2013 [13]. The dataset contains aggregated records 
which represent students’ activities and their final grades of 
the courses. “Gender” is the protected attribute. 


Baselines. We compare against well-known fairness-aware 
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Table 1: An overview of the datasets 


Dataset #instances attributes Protected attribute Balance score 
UCI student performance 649 33 Gender (F: 383; M: 266) 0.695 

OULAD 4,000 12 Gender (F: 2,000; M: 2,000) 1 

MOOC 4,000 21 Gender (F: 2,000; M: 2,000) 1 


clustering methods and a vanilla clustering method. 


k-Medoids [16]: a traditional partitioning clustering tech- 
nique that divides the dataset into k clusters so as to mini- 
mize clustering cost. Cluster centers are actual instances. 


Vanilla fairlet [6]: a fairness-aware clustering approach that 
i) decomposes the dataset into fairlets and ii) applies a vanilla 
k-center clustering algorithm [12] to form the final k clusters. 


MCF fairlet [6]: Similar to Vanilla fairlet but the fairlet 
decomposition part is transformed into a minimum cost flow 
(MCF) problem, by which an optimized version of fairlet 
decomposition in terms of cost value is computed. 


Evaluation measures. We report on clustering quality (mea- 
sured as clustering cost, see Eq. 1), cluster fairness (ex- 
pressed as cluster balance, see Eq. 4) and cluster capacity 
(expressed as cluster cardinality). 


Parameter selection. Regarding fairness, a minimum thresh- 
old of balance ¢ is set to 0.5 for all datasets in our experi- 
ments. It means that the proportion of the minority group 
is at least 50% in each resulting cluster. Regarding the 
factor in Eq. 11, a value ’ = 0.3 is chosen for our experi- 
ments from a range of [0.1, 1.0] via grid-search. We evaluate 
the clustering cost and balance score on a small dataset, 
i.e., UCI student performance dataset - Mathematics sub- 
ject w.r.t A. Theoretically, the ideal capacity of clusters is 


ES where |X| is the population of dataset X, k is the 


number of desired clusters. However, in many cases, the 
clustering models cannot satisfy this constraint, especially 
the hierarchical clustering model. Hence, we set the formula 
in Eq. 14 to compute the maximum capacity q of clusters; ¢ is 
a parameter chosen in experiments for each fair-capacitated 
clustering approach. 


0 fey a 


To find the appropriate value of ¢, we set a range of [1.0, 
1.3] to ensure all the generated clusters have members and 
evaluate the cardinality of resulting clusters on the UCI stu- 
dent performance (Mathematics subject) dataset. ¢ is set to 
1.01 and 1.2, for k-Medoids fair-capacitated and hierarchical 
fair-capacitated methods, respectively. 


5.2 Experimental results 


UCI student performance. When k is less than 4, as shown 
in Figure l-a, the clustering quality of our models can be 
close to that of the vanilla k-Medoids method. However, 
the clustering cost is fluctuated thereafter due to the ef- 
fort to maintain the fairness and cardinality of methods. 
Our vanilla fairlet hierarchical fair-capacitated outperforms 
other competitors in most cases. Vanilla fairlet and MCF 
fairlet show the worst clustering cost as an effect of the k- 


Center method. Figure 1-b depicts the clustering fairness. 
As we can observe, in terms of fairness, vanilla fairlet hier- 
archical fair-capacitated has the best performance when k is 
less than 10. Contrary to that, by selecting each point for 
each cluster in the cluster assignment step, the k-Medoids 
fair-capacitated method can maintain well the fairness in 
many cases. Regarding the cardinality, as illustrated in Fig- 
ure 1-c, our approaches outperform the competitors when 
they can keep the number of instances for each cluster un- 
der the specified thresholds. 


OULAD. Our MCF fairlet k-Medoids fair-capacitated ap- 
proach outperforms other methods in terms of clustering 
cost, although there is an increase compared to the vanilla 
k-Medoids algorithm, as we can see in Figure 2-a. Con- 
cerning fairness, in Figure 2-b, k-Medoids is the weakest 
method while others can achieve the highest balance. The 
balance of Gender feature in the dataset is the main reason 
for this result. All fairlets are fully fair; this is a prerequi- 
site for our methods of being able to maintain the perfect 
balance. Regarding cardinality, our approaches demonstrate 
their strength in ensuring the capacity of clusters (Figure 2- 
c). The difference in the size of the clusters generated by 
our methods is tiny. This is in stark contrast to the trend 
of competitors. 


MOOC. The results of clustering quality are described in 
Figure 3-a (Appendix A). Although an increase in the clus- 
tering cost is the main trend, our methods outperform the 
vanilla fairlet and MCF fairlets methods. Regarding clus- 
tering fairness, as depicted in Figure 3-b, our approaches 
can maintain the perfect balance for all experiments. This 
is the result of an actual balance in the dataset and the 
fairlets. The emphasis is our methods can divide all the ex- 
perimented instances into capacitated clusters, as shown in 
Figure 3-c, which proves their superiority in presenting the 
results over the competitors regarding clusters’ cardinality. 


Summary of the results. In general, fairness is well main- 
tained in all of our experiments. When the data is fair, in 
case of OULAD and MOOC datasets, our methods achieve 
a perfect fairness. In terms of cardinality, our methods are 
able to maintain the cardinality of resulting clusters within 
the maximum capacity threshold, which is significantly su- 
perior to competitive methods. The fair-capacitated par- 
titioning based method is better than the hierarchical ap- 
proach since it can determine the capacity threshold closest 
to the ideal capacity. Regarding the clustering cost, the hi- 
erarchical approach has an advantage over other methods by 
outperforming its competitors in most experiments. 


6. CONCLUSION AND OUTLOOK 


In this work, we introduced the fair-capacitated clustering 
problem that extends traditional clustering, solely focusing 
on similarity, by also aiming at a balanced cardinality among 
the clusters and a fair-representation of instances in each 
cluster according to some protected attributes like gender 
or race. Our solutions work on the fairlets derived from 
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a) Clustering quality (lower is better) b) Clustering fairness (higher is better) 


MCF fairlet hierarchical fair-capacitated 


2600 + 
2400 4 
B 
8 e 
> 2 —@— k-Medoids 
‘5.22007 e pe —— Vanilla fairlet 
g —@ k-Medoids 0.3) _# Vanilla fairlet hierarchical fair-capacitated 
is) —— Vanilla fairlet —@-— Vanilla fairlet k-Medoids fair-capaciatated 
2000 7 —®- Vanilla fairlet hierarchical fair-capacitated 0.2 4 —— MCF fairlet 
—@- Vanilla fairlet k-Medoids fair-capaciatated —®— MCF fairlet hierarchical fair-capacitated 
—<— MCF fairlet 0.14 —®- MCF fairlet k-Medoids fair-capacitated 
1800 | —®- MCF fairlet hierarchical fair-capacitated eee << Minimum balaney a 
—@® MCF fairlet k-Medoids fair-capacitated o.o4. ct Dataset's balance 
3 4 5 6 7 8 9 1011 12 13 14 15 16 17 18 19 20 3 4 5 6 7 8 9 1011 12 13 14 15 16 17 18 19 20 
Number of clusters Number of clusters 
c) Clustering cardinality 
300 4 — kMedoids 
— Vanilla fairlet 
250 4 — Vanilla fairlet hierarchical fair-capacitated 
ly — Vanilla fairlet k-Medoids fair-capacitated 
c) — MCF fairlet 


Number of instances 
Bb 
a 
fo} 
1 


MCF fairlet k-Medoids fair-capacitated 
--- Maximum capacity of hierarchical fair-capacitated 
sees Maximum capacity of k-Medoids fair-capacitated 


100 4 
50 4 mar ore fT, Ppl — rtp a 
| 
3 13 14 15 16 17 18 19 20 
Number of clusters 
Figure 1: Performance of different methods on UCI student performance dataset 
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Figure 2: Performance of different methods on OULAD dataset 
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ity. In the future, we plan to extend our approach for more 
than one protected attributes as well as to further investigate 
what fair group assignments means in educational settings. 
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APPENDIX 
A. MOOC DATASET 
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Figure 3: Performance of different methods on MOOC dataset 


B. HIERARCHICAL FAIR-CAPACITATED 
ALGORITHM 


Algorithm 2: Hierarchical fair-capacitated algorithm 
Input: F = {F1, Fo,..., Fi}: a set of fairlets 
q: a given maximum capacity of final clusters 
W = {wi,we,..., wi}: weights of fairlets 
k: number of clusters 
Output: A fair-capacitated clustering 
1 compute the proximity matrix ; 
2 clusters + F //each fairlet Fj is considered as cluster ; 
3 repeat 
4 cluster,, clusterg < the closest pair of clusters ; 
5 if capacity(cluster1) + capacity(cluster2) < q then 
6 newcluster < merge(clusteri, cluster2); 
7 update clusters with newcluster; 
8 update the proximity matrix ; 
9 else 
10 | continue; 
11 end 


12 until k clusters remain; 
13 return clusters; 
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ABSTRACT 


Cardiopulmonary resuscitation (CPR) is a foundational life- 
saving skill for which medical personnel are expected to 
be proficient. Frequent refresher training is needed to pre- 
vent the involved skills from decaying. Regular low-dose, 
high-frequency training for staff at fixed intervals has proven 
successful at maintaining CPR competence but does not take 
into account inherent performance variability across learn- 
ers. Tailoring refreshers to an individual’s past performance 
would minimize personnel being trained too (in)frequently 
and would ensure faster knowledge acquisition for new learn- 
ers. To maximize the benefits of individualized schedules, 
learning needs gleaned from past training history need to 
be identified. A recent field study conducted among nursing 
students showed that a cognitive model-based approach was 
able to successfully trace the knowledge acquisition and decay 
of learners and prescriptively devise personalized training 
regimes that outperformed fixed schedules with regards to 
both training efficiency and learners’ performance. Here, we 
report a post-hoc analysis of the collected data to investigate 
whether an alternative modeling approach, blending cogni- 
tive modeling and machine learning, could have resulted in 
even higher quality predictions. Our results reveal modest 
improvements in predictive accuracy for ensemble models, in 
which machine learning models predict the prediction errors 
(i.e., residuals) of the standalone cognitive model. These 
promising findings reveal strong applied utility for future use 
in domains where sustained proficiency is a requirement. 


Keywords 
Predictive modeling; Cognitive model; Machine learning; 
Cardiopulmonary resuscitation; Learning 
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1. INTRODUCTION 


Cardiopulmonary resuscitation (CPR) is a basic life-saving 
skill but it has been shown that medical professionals are not 
able to perform it consistently [1]. To remedy this shortfall, 
several improvements to skill acquisition and maintenance 
programs have been proposed [6]. One dimension of the shift 
in educational focus [5] emphasizes increased re(training) effi- 
ciency by moving towards personalized, adaptive scheduling. 
The current work aims to facilitate this development. 


Currently, as in many domains, refresher trainings at fixed 
intervals (i.e., regular and the same for everyone) are re- 
quired to maintain CPR compliance. A recent effort [14, 18] 
has shown that a cognitive model representing regularities 
of memory can be leveraged to devise personalized train- 
ing schedules that maintain proficiency at lower cost and 
risk. This effort will be referred to as the CPR field study 
throughout the current text, and the data collected during 
this experiment (see sections 2.1 and 2.2 for details) will form 
the bedrock on which the efforts presented here will build. 


Specifically, we conducted a post-hoc simulation study of 
the CPR field study data to explore whether the cognitive 
model’s predictions could be enhanced by combining it with 
machine learning (ML) models that can leverage additional 
information to improve predictive accuracy. The combination 
of the two modeling approaches is achieved by fitting the 
models sequentially, forming an ensemble model in which the 
cognitive model’s residuals are learned by the ML models. 
We show that their combined predictions afford a modest 
improvement over the cognitive model by itself and are prefer- 
able to using the ML models by themselves. 


Modern CPR training is an interesting educational data 
mining domain and modeling task because trainings are con- 
ducted on advanced manikins equipped with an array of sen- 
sors that quantify various aspects of a student’s performance 
against objective performance guidelines [16]. Consequently, 
large amounts of high-resolution data are readily collected 
for a given event. The challenge is that events are usually 
spaced months apart, which provides a sparse sampling space 
for knowledge tracing. Consequently, it is difficult to make 
precise predictions. However, given quality CPR’s central 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 415 


role in the “chain of survival” [6, 5], even small improvements 
in predictive accuracy can conceivably have large real-life 
impacts—especially if predictive models can identify those 
most in need of more frequent refresher trainings and help 
them to maintain compliance. 


Generally speaking, the fields of cognitive science and ma- 
chine learning have approached the computational modeling 
of a task like CPR skill acquisition and maintenance with 
different mindsets [24]. Specifically, cognitive models primar- 
ily focus on explaining the mechanisms that drive individual 
differences; machine learning models primarily focus on out- 
of-sample prediction. Aiming to combine the best of both 
approaches, we engineered a pipeline of predictive models: 
First, a cognitive model that was specifically developed as a 
prescriptive tool [13] is fit to the training data, which results 
in residuals that indicate which instances are fit poorly by 
the cognitive model. Next, a ML model is fit to those resid- 
uals, effectively learning to fine-tune the cognitive model’s 
predictions. We show that such an ensemble approach can 
provide improvements in predictive accuracy without sacri- 
ficing interpretability, which is important to retain so that 
the model’s personalized prescriptions are fully explainable. 
With an eye on advancing predictive tools in the domain 
of CPR training, our core motivation is to assess whether 
alternative predictive approaches could have yielded better 
result in the CPR field study, so that insights gained can be 
leveraged in future applied settings. 


1.1. The current study 

Here, we will use the exact version of the cognitive model that 
was used in the CPR field study as a yardstick to determine: 
(I) How well would alternative instantiations of the cognitive 
model have performed?, (II) Would a number of off-the-shelf 
ML approaches have yielded superior predictions than the 
cognitive model?, and (III) Could ML models be used to 
learn the prediction error of the cognitive model? 


We believe the last question is the most pertinent. How- 
ever, the combined approach should be compared to the ap- 
proaches that only use either of the two modeling approaches 
in isolation to ascertain whether it has any benefit. 


2. METHODS 


This section will outline the data we used for our exploration, 
the input features that are available in the data, the predictive 
model we employed, and the setup of the simulation study 
we conducted to address our research questions. Figure 1 
provides a schematic overview of the approach and its parts 
and connections are going to be explained in the following. 


2.1 Data 


Data were from a multi-phased, longitudinal study conducted 
at 10 nursing schools across the United States [18]. A to- 
tal of 475 nursing students started the study. Participants 
were randomly assigned to 4 initial acquisition conditions 
where they completed 4 consecutive CPR training sessions 
that were spaced by 1 day, 1 week, 1 month, or 3 months. 
Students were additionally randomized to 3 maintenance 
training conditions where they refreshed their skills for 1 
year at intervals of 3 months, 6 months, or at personalized 
intervals prescribed by the cognitive model. During each 


Training \ | =~ ML —=> ML alone 
data N=4 
PPE |— => | 4 models | —=—=_p PPE+ML 
nae PPE alone 
Session i vai — > 4 


Figure 1: Schematic of the various predictive models. 


session, students completed a series of CPR events using the 
Resuscitation Quality Improvement (RQI®)) system with 
Laerdal’s Resusci Anne®) adult QCPR manikin. 


First, students completed a pre-test consisting of 60 com- 
pressions or 12 bag-mask ventilations with no feedback from 
the manikin, followed by trainings where students received 
real-time, dynamic feedback to guide, and then post-tests 
with no feedback. RQI provides composite scores for the 
quality of compressions and ventilations on scales of 0 to 100, 
with higher scores corresponding to better performance. The 
compression score is based on depth, rate, release, and hand 
placement. The ventilation score is based on volume, rate, 
and compliance with inspiration time. 


Prior to the onset of the study, participants completed a 
demographic questionnaire. Of the 475 participants that 
began the study, we included in our study the 393 that 
completed the initial acquisition phase. Due to the variations 
in retraining schedules across the maintenance phase, not all 
students completed the same number of sessions. We focus 
here on data from the first through the eighth session since 
the majority of students completed 8 sessions. 


2.2 Input features 

This section describes all information available in the “train- 
ing data” box of Figure 1. As sessions progress, the number 
of available training instances grows but the number of input 
features is constant. Using the color-coded arrows in Fig- 
ure 1 to categorize them, these features are detailed in the 
following and the labels correspond to those in Figure 4A. 


Gray arrow in Figure 1: time/lag: An event’s timestamp 
expressed as the number of seconds since the first /previous 
recorded event; score: The composite performance score 
recorded for each event. 


White arrow in Figure 1: compvent: Does the recorded event 
correspond to performing compressions or ventilations?; ac- 
gint and maintint: What acquisition-maintenance interval 
combination was this user assigned to? Together, these de- 
fine the 12 experimental conditions (see previous section for 
detail) that the condition PPE variant is based on; session: 
Session counter 1 through 8; pretrnpst: Was this a pre- 
test, training, or post-test?; site, age, gender, height, and 
weight: Demographic information associated with each user 
(time-invariant). There were ten sites/locations and other 
information was coded in years, male/female, inches, and 
pounds, respectively; profile: We reduced the unique user 
IDs to a low-dimensional set of performance profiles. This 
approach was inspired by earlier work [2, 23] that showed 
a small number of descriptive profiles could be obtained by 
performing k-means clustering [9]. Here, we used k = 4, 
and re-partitioned the training data on each iteration of the 
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simulation. Specifically, only pre-test scores across both skills 
were taken into account. 


Yellow arrow in Figure 1: decay rate and model time: The 
two PPE terms computed when fitting PPE (see section 
2.3.1) are passed to the ML models; residual: The difference 
between PPE’s fit and score. 


The three rightmost arrows indicate the predictions that 
are made, emphasizing that the PPE+ML ensemble models 
(purple) uses both the cognitive (blue) and machine learning 
(red) models. Next, we outline which ML models were used 
and how the five PPE variants were fit to the data. 


2.3 Predictive models 

As noted in Figure 1, there are a total of 29 models. Here, 
we describe the PPE and ML models that make up the 
PPE alone/ML alone predictions. The remaining majority 
of models are based on combining the PPE+ML models by 
first fitting the PPE as described below and subsequently 
computing the PPE’s residuals in the training data and 
training the ML model to predict those. The PPE+ML 
predictions can thus be thought of PPE predictions that 
were fine-tuned by a given ML model. 


2.3.1 Predictive performance equation (PPE) 

The PPE is a set of nested mathematical equations that 
capture findings in the cognitive science literature associated 
with the temporal dynamics of human learning and forgetting 
[26]. These include the power law of learning, the power law 
of forgetting, the spacing effect, and effects of relearning. Two 
essential components of PPE are sub-equations that model 
time and the rate of knowledge/skill decay. The model time 
equation captures the idea that the age of items in memory 
should be skewed toward the most recent presentations, but 
the full study history should be represented. Hence, model 


time for each instance 7 (across n instances) is w; x ti, where 
—0.6 
—— and ft; is the time, in seconds, relative to the 
nt 
first instance. The decay rate equation captures the idea that 
spacing practice across time produces more stable knowledge 
that decays at a slower rate, while massing practice produces 
less stable knowledge that decays at a faster rate. Since 
model time and the decay rate are essential to how PPE 
captures learning and decay, we include them separately 
as features in the machine learning models. For a more 
extensive description of the mechanics of the PPE, please 
see [25, 26]. 


Wi = 


In the CPR field study, PPE was fit separately to each 
participant’s history of compression and ventilation scores. 
We refer to this variant as the original PPE throughout 
the paper. The rationale for individual fits was from prior 
research suggesting that each individual would have unique 
learning and forgetting trajectories across sessions due to 
psychometric differences, the trajectories would vary for 
compressions and ventilations, and thus predictive accuracy 
would be maximized by fitting to each student on each skill. 


Here, we conduct post-hoc simulations to explore these as- 
sumptions by comparing the methodology used in the field 
study to less granular PPE variants in which free parameters 
are fit to: experimental condition (acquisition and mainte- 


nance intervals), skill (compressions and ventilations), user, 
or user’s performance profile. By exploring these different 
groupings for which a set of unique parameters are estimated, 
we evaluate the trade-space between model flexibility and 
predictive accuracy, and how this interacts with the amount 
of data available for fitting the models. 


2.3.2 Machine learning models 

We used four different machine learning models. Depending 
on the approach, these were either trained to predict the 
score (red arrow in Figure 1) or PPE’s residuals (purple 
arrow in Figure 1). For either task, all models had access to 
all input features outlined in section 2.2 (also see x-axis of 
Figure 4A). All models were run with the default settings of 
the cited R packages [21]. 


As the simplest model, we fit a single decision tree to the 
scores/residuals. Each tree was pruned through 10-fold cross 
validation—as implemented in [22]—to avoid overfitting. In 
most cases, this resulted in very shallow trees and sometimes 
even single node “stumps.” Hence, the decision trees can be 
thought of as baseline models. 


The most complex model was a random forest [4], which is 
an ensemble of decision trees. Using the default settings in 
[15], we used both bagging and random feature sub-setting 
to grow a forest of 500 trees. A recent comparison of gradient 
boosting algorithms included random forests as a comparison 
and nicely showed that they perform very well on a range 
of ML tasks and have the added benefit of not requiring 
hyperparameter tuning [3]. The disadvantage of random 
forests—as with many ML approaches—is that the internal 
mechanics that result in a prediction are not easily inspected 
(see our discussion around Figure 4 below). 


The two other models were ridge regression and the lasso 
[10], which apply slightly different shrinkage terms when 
coefficients are estimated. The key difference between the 
two methods is that ridge regression will retain coefficients 
for all input features, while the lasso effectively performs 
feature selection (see section 6.2 in [12] for an introduction). 
This generally makes lasso models more interpretable. Both 
models have a single hyperparameter, A, that was tuned 
for each iterative prediction using the cross-validation pro- 
cedure implemented in [8] and all numerical features were 
standardized. 


2.4 Simulation study and analysis 

The approach to our simulation study can be summarized as 
follows: For each session s = 1,2,...,7, train the models on 
data up to session s and issue predictions for the pre-test of 
session s +1. We focused on pre-test scores because we were 
interested in predicting students’ readiness to perform CPR 
compressions and ventilations, prior to additional training. 
The procedure was run for all 29 (combinations of) mod- 
els described above and generated iterative predictions for 
sessions 2 through 8. The quality of predictions across the 
models will be compared by computing the mean absolute 
error (MAE), which summarizes how accurately, on average, 
each model predicts the scores recorded in the subsequent 
session. This yields 7 x (4+ 20 + 5) = 203 (i.e., number of 
predicted sessions times number of models) errors. For the 
sake of easier presentation of these results, we subsequently 
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Figure 2: Average MAE across sessions for all models. 


summarize the errors by (i) computing the average MAE for 
each model (Figure 2), and (ii) ranking the models based on 
their errors (Table 2). These overall results are elaborated 
on with additional figures and tables that highlight relevant 
details. 


3. RESULTS 


Figure 2 presents the average MAE for all 29 models and 
speaks to all three research questions posed in section 1.1. 
As detailed in section 2.4, the 203 prediction errors were 
aggregated across sessions and the average MAE for each 
combination of models is presented as a heatmap. The color- 
coding corresponds to the magnitude of the errors; lower 
values are better. By averaging across sessions, variations 
in predictive accuracy as a function of session (see Figure 3) 
is lost but it becomes easier to assess the model’s relative 
performance in one glance. 


First, we can look at the “PPE alone” row in Figure 2 to 
compare the five PPE variants that were tested. Overall, the 
original instantiation of the cognitive model as used in the 
CPR field study—if used alone—does indeed outperform the 
more constrained variants explored here. This is somewhat 
surprising since the original model exhibited signs of over- 
fitting (i-e., fit MAE lower than prediction MAE) that were 
ameliorated for the constrained variants of PPE. However, it 
appears that despite overfitting the training data, the orig- 
inal variant of PPE did produce the best predictions after 
all. 


Second, whether a number of off-the-shelf ML approaches 
would have yielded superior predictions than the cognitive 
model can be assessed by comparing the cell original PPE 
alone (MAE = 19.5) with the prediction errors in the “ML 
alone” column in Figure 2. All ML approaches yield average 
errors larger than 19.5 when applied alone, which suggests 
that the ML models tested here—if used by themselves— 
would not have resulted in better predictions overall. How- 
ever, the differences in prediction errors are not large and 
the ridge and lasso regression in particular perform well on 
average. 


And third, and most importantly, we turn to the PPE+ML 


244 
—— MLalone —— user 


— condition 


— profile 


224 — original — skill 


20+ 


Mean absolute error (MAE) 


3 4 5 6 7 8 
Session that predictions were made for 


Figure 3: Comparison of all models in the “random forest” 
row of Figure 2, showing prediction errors for each session. 


combinations. These correspond to the larger facet labeled 
“PPE variants” in Figure 2. A number of notable patterns 
emerge: the average MAE for the original PPE is hardly 
affected by adding any of the ML models to predict its 
residuals. This might be because this variant of PPE is 
very flexible, which restricts the residuals in the training 
data that the ML can actually fit to. For all other PPE 
variants, we see a gradient from top to bottom, with average 
MAEs decreasing with ML models relative to PPE alone. 
The decision trees are an exception to this pattern and seem 
to worsen the performance more often than not. Otherwise, 
we generally see the lasso and ridge regressions improving 
on PPE alone and the PPE+random forest resulting in the 
best performance for all PPE variants. 


Zooming in on the models using random forests, Figure 3 
shows the session-by-session prediction errors of the ran- 
dom forest alone (ML alone) and the PPE+random forest 
combinations. We omitted predictions for the second ses- 
sion because they are quite poor for the random forests 
combined with the condition and skill PPE variants, which 
distorts the y-axis and obscures differences between models 
in the later sessions’. The figure highlights that the ran- 
dom forest alone consistently performs worse than all other 
combinations of models, in which the random forest is used 
specifically to learn PPE’s residuals rather than observed per- 
formance. This suggests that the most promising approach 
is an ensemble of a PPE variant that captures the overall 
temporal dynamics to issue predictions that are subsequently 
fine-tuned by a random forest that can leverage all other 
available input features. 


Another way to summarize these results is by ranking all 29 
models’ MAE within each session and computing the average 
rank for each PPE variant. These average ranks are shown 


‘Predictions for session 2 were generally much worse than 
for all subsequent sessions. We ran all analyses reported here 
without session 2 predictions to confirm that our conclusions 
do not depend on differences between models on session 2. 
If session 2 is omitted, the skill PPE variant performs a 
little better overall but results are not otherwise affected 
drastically. 
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Table 1: Comparison of PPE variants across all ML models. 


PPE variant average rank average MAE 


profile 11.9 20.1 
skill 12.7 23.2 
original 15.1 19.3 
user 17.2 20.8 
condition 18.4 23.4 


in Table 1, and reveal that although the original PPE yields 
the lowest overall average MAE, both the profile and skill 
variants achieve better average ranks. This suggests that if 
ML models are leveraged to predict PPE’s residuals, more 
constrained variants of PPE tend to perform better. However, 
even the lowest average ranks listed in Table 1 is relatively 
high, suggesting that no model consistently outperforms 
the others. This observation is confirmed by inspecting the 
models’ MAEs in detail (not shown here), which reveals that 
for some sessions, most models perform effectively identically. 


Lastly, we present the top 10 models in terms of their ranking 
in Table 2. Here, the ranks are computed as an average 
across the seven sessions each model made predictions for. 
The best-performing model is the PPE with parameters for 
each performance profile whose residuals are predicted by 
a random forest. Figure 2 corroborates this observations, 
showing that this combination of models obtains the lowest 
average MAE overall. Notably, all five instances of the 
original PPE and five out of six instances of random forests 
are represented in the top 10, confirming that these models 
perform very well in various combinations. 


3.1. Peeking into the random forest 

Space constraints limit the amount of model interrogation 
we can report here. However, we want to at least showcase 
one prominent example. Table 2 and Figure 2 show that the 
best model overall is the combination of the PPE variant 
with unique parameters for each performance profile and a 
random forest that learns its residuals. (This model is the 
blue line in Figure 3, which highlights that other models 
perform very similarly.) Figure 4A shows the normalized 
feature importance computed for each input feature (white 
and yellow arrows in Figure 1) for each iterative session that 
predictions are made for. Superimposed are the average 
importance and the spread (in black) and features are sorted 
from least to most important based on average importance. 
Notably, most time-invariant features (gender, age, etc.) are 
equally important across the seven iterations. The session 
counter, stability, and model time, on the other hand, 
become gradually more important as more sessions were 
included in the training data, while the opposite pattern is 
evident for compvent and users’ performance profile. 


Feature importance plots as shown here can be informa- 
tive but should be interpreted with caution since they do 
not capture and visualize the potential intricate non-linear 
relationships between the various input features [17]. Further- 
more, feature importance and their impact on predictions are 
not necessarily the same—more advanced approaches exist 
[19] but are beyond the scope of the current paper. 


Table 2: The top 10 models overall sorted by average rank 
across the seven predictions made by each model. 


PPE variant ML model average rank 
1 profile random forest 5.3 
2 skill random forest 6.7 
3 condition random forest 7.9 
4 original lasso 8.1 
i) original random forest 8.3 
6 original decision tree 8.4 
7 original ridge 8.6 
8 user random forest 10.1 
9 original PPE alone 10.7 
10 condition ridge 11.9 


Figure 4B zooms in on two important features and shows the 
predictions made for the profile PPE model for the fourth ses- 
sion against the residuals the random forest predicts for each 
instance. We generally see the most differentiation between 
models on Session 4, which is why we chose it—however, 
this figure is broadly representative of the profile PPE+RF 
dynamics for other sessions. Figure 4B suggests that ven- 
tilations are more often down-adjusted than compressions 
(i.e., more triangles below the equality line) unless PPE pre- 
dicts near-ceiling performance. The fact that model time 
is consistently identified as the most important feature (see 
Figure 4A) but no clear relationship between the magnitude 
of the adjustment (i.e., distance from equality line) emerges 
in Figure 4B highlights the disadvantage of applying ML 
models—such as a random forest—that are challenging to 
interrogate. 


4. DISCUSSION 


The post-hoc simulation study reported here suggests that 
the original cognitive model used for prescriptive, adaptive 
scheduling in the CPR field study performed very well overall. 
In fact, in the aggregate, it resulted in lower average predic- 
tion errors than both the more constrained variants of PPE 
and the machine learning models included in the current 
comparison. Thus, it is unlikely that the tested off-the-shelf 
ML models would have performed better than the original 
PPE, although the regularized regression models (ridge and 
lasso) in particular achieved prediction errors similar to the 
original PPE. We expected the ML models to outperform 
the cognitive model because the latter’s main “insights” (the 
estimated model time and decay rate) were included as in- 
put features to the ML models (yellow arrow in Figure 1. 
This suggests that the PPE, using much less information, 
was slightly better at extrapolating performance to the next 
session. 


The current explorations also showed, however, that an en- 
semble cognitive and ML model has the potential to perform 
slightly better than either alone. Notably, the more con- 
strained variants of PPE performed particularly well in this 
ensemble arrangement. One possible explanation is that the 
less flexible cognitive model operates as a smoothing function 
on the temporal information, which leaves the ML to learn 
under which conditions (i.e., [combinations of] input features) 
the general temporal dynamics should be adjusted to fur- 
ther improve predictions. This framing of the procedure is 
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Figure 4: Details on the best-performing model. (A): The normalized feature importance for each iteration of the model with 
superimposed averages (black dots). (B): Initial PPE predictions for Session 3 (x-axis) plotted against fine-tuned predictions 
(y-axis); color-coding indicates the model time for each predicted instance and shape differentiates compvent. 


conceptually akin to a two-step boosting algorithm [7] and 
the “fine-tuning” of predictions induced by the second step 
(the ML model in this case) is nicely illustrated in Figure 4B, 
which shows the results of the first step (the constrained cog- 
nitive model’s predictions on the x-axis), against the results 
after the second boosting-like step (the PPE+ML predictions 
on the y-axis). In this process, the initial PPE prediction’s 
quite restricted range is expanded by the random forest, 
which can—and does (cf. Figure 4A)—draw on various input 
features. 


The ensemble approach outlined here has the added benefit 
of a modular structure. Thus, it is easy to make design 
decisions, particularly regarding (i) which constraints should 
be built into the parameter fitting procedure for PPE, and 
(ii) what type of ML model is most informative. The latter 
will determine where on the continuum of interpretability the 
ensemble will fall. For example, the dynamics of the random 
forest that predicts the profile PPE’s residuals (highlighted in 
Figure 4), does not lend itself to straightforward model inter- 
rogation but the PPE+lasso and PPE-+ridge combinations 
would not reduce the interpretability of the ensemble, while 
slightly reducing prediction errors relative to PPE alone (see 
Figure 2). 


It should be pointed out, however, that the improvement in 
prediction errors relative to the original PPE alone is minor. 
Nevertheless, we consider these findings significant for two 
reasons: First, the small improvement vindicates that PPE’s 
time-based mechanisms capture the majority of variance in 
this task domain. Second, the PPE+ML ensemble approach 
used here serves as a proof-of-concept that illustrates how 
the core mechanism of PPE can be preserved while incorpo- 
rating an arbitrary number of additional input features. For 
example, some of the input features used here were specific to 
the field study’s design (notably acqint and maintint) and 
would not be present in the hospital setting RQI systems are 
primarily deployed in. In such settings, however, other input 
features would be available (e.g., job title or department) 
and samples would be larger and more heterogeneous, which 


would conceivably introduce more variance that is not a 
function of time-based features alone. We expect that under 
these conditions, the ensemble approach’s advantage over 
PPE alone would be more pronounced. 


In the current effort, we choose to assess the models’ ability 
to make session-by-session predictions. This approach meant 
that events did not line up chronologically (a student in the 
weekly acquisition condition will have completed the first four 
session before a student in the 3-month condition returned 
for their second session) but the amount of training data 
available for each student is equalized—only the lag between 
events varies. This reveals, for example, that predictions 
improve up to session 4 (the end of the acquisition phase; 
see Figure 3) and then get worse for session 5, which is when 
students switch to the maintenance phase. This suggests 
that the models get better at forecasting performance as 
more data from a consistent schedule becomes available, and 
that one should expect a dip in predictive accuracy as the 
temporal dynamics are altered. 


Future work in this domain should validate the approach 
presented here in more naturalistic data that more closely 
resemble how medical professionals train and maintain CPR 
proficiency. We believe that cognitive models in particular— 
and a cognitive-machine learning ensemble specifically—hold 
great promise in moving towards a predictive framework that 
affords personalized, adaptive refresher training schedules 
that are tailored towards individual learning needs—either 
of an individual or groups of learners that exhibit similar 
performance profiles. Furthermore, the outlined predictive 
pipeline’s potential value in adaptive, educational learning 
system outside of the medical domain should be explored. 


5. ACKNOWLEDGEMENTS 

This work was funded through the 711th Human Perfor- 
mance Wing Chief Scientist Seedling award at the Air Force 
Research Laboratory. Data were wrangled in R using [28]; 
tables were created with [11], and figures with [27] and [20]. 


420 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


6. 
[1] 


[3 


[4] 


[5] 


(6 


[12] 


[13] 


[14] 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


REFERENCES 

B. S. Abella, J. P. Alvarado, H. Myklebust, D. P. 
Edelson, A. Barry, N. O’Hearn, T. L. V. Hoek, and 

L. B. Becker. Quality of cardiopulmonary resuscitation 
during in-hospital cardiac arrest. Jama, 293(3):305-310, 
2005. 

E. Ayers, R. Nugent, and N. Dean. Skill set profile 
clustering based on student capability vectors 
computed from online tutoring data. Educational Data 
Mining, 2008. 

C. Bentéjac, A. Csérgé, and G. Martinez-Munoz. A 
comparative analysis of gradient boosting algorithms. 
Artificial Intelligence Review, pages 1-31, 2020. 

L. Breiman. Random forests. Machine learning, 
45(1):5-32, 2001. 

A. Cheng, D. J. Magid, M. Auerbach, F. Bhanji, B. L. 
Bigham, A. L. Blewer, K. N. Dainty, E. Diederich, 

Y. Lin, M. Leary, et al. Part 6: resuscitation education 
science: 2020 american heart association guidelines for 
cardiopulmonary resuscitation and emergency 
cardiovascular care. Circulation, 
142(16_Suppl_2):S551-S579, 2020. 

A. Cheng, V. M. Nadkarni, M. B. Mancini, E. A. Hunt, 
E. H. Sinz, R. M. Merchant, A. Donoghue, J. P. Duff, 
W. Eppich, M. Auerbach, et al. Resuscitation 
education science: educational strategies to improve 
outcomes from cardiac arrest: a scientific statement 
from the american heart association. Circulation, 
138(6):e82—e122, 2018. 

Y. Freund, R. Schapire, and N. Abe. A short 
introduction to boosting. Journal-Japanese Society For 
Artificial Intelligence, 14(771-780):1612, 1999. 

J. H. Friedman, T. Hastie, and R. Tibshirani. 
Regularization paths for generalized linear models via 
coordinate descent. Journal of Statistical Software, 
Articles, 33(1):1-22, 2010. 

J. A. Hartigan and M. A. Wong. A K-means clustering 
algorithm. Journal of the Royal Statistical Society: 
Series C (Applied Statistics), 28(1):100-108, 1979. 

T. Hastie, R. Tibshirani, and J. Friedman. The 
elements of statistical learning: data mining, inference, 
and prediction. Springer Science & Business Media, 
2009. 

M. Hlavac. stargazer: Well-Formatted Regression and 
Summary Statistics Tables. Central European Labour 
Studies Institute (CELSI), Bratislava, Slovakia, 2018. R 
package version 5.2.2. 

G. James, D. Witten, T. Hastie, and R. Tibshirani. An 
introduction to statistical learning. Springer, 2013. 

T. S. Jastrzembski, K. A. Gluck, and G. Gunzelmann. 
Knowledge tracing and prediction of future trainee 
performance. In Interservice/Industry Training, 
Simulation, and Education Conference, pages 
1498-1508. National Training Systems Association, 
2006. 

T. S. Jastrzembski, M. Walsh, M. Krusmark, 

S. Kardong-Edgren, M. Oermann, K. Dufour, 

T. Millwater, K. A. Gluck, G. Gunzelmann, J. Harris, 
et al. Personalizing training to acquire and sustain 
competence through use of a cognitive model. In 
International conference on augmented cognition, pages 
148-161. Springer, 2017. 


[15] 


[16] 


[17] 


[18] 


19 


20 


21 


22 


23 


[24] 


[25] 


[26] 


[27] 


[28] 


A. Liaw and M. Wiener. Classification and regression 
by randomForest. R News, 2(3):18-22, 2002. 

R. M. Merchant, A. A. Topjian, A. R. Panchal, 

A. Cheng, K. Aziz, K. M. Berg, E. J. Lavonas, D. J. 
Magid, A. Basic, P. B. Advanced Life Support, R. E. S. 
Advanced Life Support, Neonatal Life Support, and 

S. of Care Writing Groups. Part 1: Executive summary: 
2020 american heart association guidelines for 
cardiopulmonary resuscitation and emergency 
cardiovascular care. Circulation, 
142(16_Suppl_2):S337-S357, 2020. 

C. Molnar. Interpretable Machine Learning. 2019. 
https://christophm.github.io/interpretable-ml-book/. 
M. Oermann, M. Krusmark, S. Kardong-Edgren, T. S. 
Jastrzembski, and K. A. Gluck. Personalized training 
schedules for retention and sustainment of CPR skills. 
Simulation in Healthcare, 2021. 

T. Parr, J. D. Wilson, and J. Hamrick. Nonparametric 
feature impact and importance. arXiv preprint 
arXwv:2006.04750, 2020. 

T. L. Pedersen. patchwork: The Composer of Plots, 
2019. R package version 1.0.0. 

R Core Team. R: A Language and Environment for 
Statistical Computing. R Foundation for Statistical 
Computing, Vienna, Austria, 2020. 

B. Ripley. tree: Classification and Regression Trees, 
2019. R package version 1.0-40. 

F. Sense, M. Collins, M. Krusmark, and T. S. 
Jastrzembski. Using k-means clustering for 
out-of-sample predictions of memory retention. In 
Proceedings of the 42nd Annual Conference of the 
Cognitive Science Society, 2020. 

G. Shmueli et al. To explain or to predict? Statistical 
science, 25(3):289-310, 2010. 

M. M. Walsh, K. A. Gluck, G. Gunzelmann, 

T. Jastrzembski, and M. Krusmark. Evaluating the 
theoretic adequacy and applied potential of 
computational models of the spacing effect. Cognitive 
science, 42:644-691, 2018. 

M. M. Walsh, K. A. Gluck, G. Gunzelmann, 

T. Jastrzembski, M. Krusmark, J. I. Myung, M. A. 
Pitt, and R. Zhou. Mechanisms underlying the spacing 
effect in learning: A comparison of three computational 
models. Journal of Experimental Psychology: General, 
147(9):1325, 2018. 

H. Wickham. ggplot2: Elegant Graphics for Data 
Analysis. Springer-Verlag New York, 2016. 

H. Wickham, M. Averick, J. Bryan, W. Chang, L. D. 
McGowan, R. Francois, G. Grolemund, A. Hayes, 

L. Henry, J. Hester, M. Kuhn, T. L. Pedersen, E. Miller, 
S. M. Bache, K. Miiller, J. Ooms, D. Robinson, D. P. 
Seidel, V. Spinu, K. Takahashi, D. Vaughan, C. Wilke, 
K. Woo, and H. Yutani. Welcome to the tidyverse. 
Journal of Open Source Software, 4(43):1686, 2019. 


421 


Math Multiple Choice Question Solving and Distractor 
Generation with Attentional GRU Networks 


Neisarg Dave* 
Pennsylvania State University 


nud83@psu.edu 


Riley Bakes* 
Pennsylvania State University 


rob5372@psu.edu 


Barton Pursel 
Pennsylvania State University 


bkp10@psu.edu 


C. Lee Giles 
Pennsylvania State University 


clg20@psu.edu 


ABSTRACT 


We investigate encoder-decoder GRU networks with atten- 
tion mechanism for solving a diverse array of elementary 
math problems with mathematical symbolic structures. We 
quantitatively measure performances of recurrent models on 
a given question type using a test set of unseen problems with 
a binary scoring and partial credit system. From our find- 
ings, we propose the use of encoder-decoder recurrent neural 
networks for the generation of mathematical multiple-choice 
question distractors. We introduce a computationally inex- 
pensive decoding schema called character offsetting, which 
qualitatively and quantitatively shows promise for doing so 
for several question types. Character offsetting involves freez- 
ing the hidden state and top k probabilities of a decoder’s 
initial probability outputs given the input of an encoder, 
then performing k basic greedy decodings given each of the 
frozen outputs as the initialization for decoded sequence. 


Keywords 

Math Question Solving, Distractor Generation, Math Multi- 
ple Choice Questions, Mathematical Language, Math Educa- 
tion 


1. INTRODUCTION 
1.1. Problem Statement 


Here we focus on the needs of mathematics educators in high 
school and early university education, One of the most tedious 
jobs for a teacher is to create exams and quizzes and grade 
them. The more time they spend on these tasks, the less 
time they spend teaching students. An automated system 
capable of creating reliable math questions of consistent 
difficulty level, creating solutions, generating distractors for 
them, and finally be able to grade them is the holy grail of 
educational automation. In this paper we focus on solving 
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the math questions and generating distractors for Multiple 
Choice Questions. 


Questions in mathematics are different from other subjects 
such as English, History, or Economics. In mathematics and 
by extension in all STEM fields, questions and answers not 
only are composed of natural text but are often accompanied 
by symbolic equations, expressions, inequalities or relational 
information. Ganeslingam (6| postulates in his work that 
these non-textual objects not only augment the context of 
the textual part but also derive their context from it. These 
non-textual objects are not part of the natural language and 
hence require special treatment. On the semantic level, math- 
ematical questions require an underlying understanding of 
rules before a question can be solved. A mere comprehension 
is not sufficient. To solve a simple problem in arithmetic, the 
fundamental understanding of the four operators is necessary. 


In this work we experiment with a network which has had 
historical success on natural language processing problems 
and test its ability to generalize mathematical knowledge 
from an open source data set contributed by consisting 
of elementary focused question types. Alternatively as a 
second problem, for some question types, we examine whether 
models which fail to generalize to the test set may have their 
incorrect solutions leveraged as ‘good’ distractor options for 
multiple choice questions like those seen in multiple choice 
questions on a math quiz. As following with the precedent 
of the data set contributor [22], mathematical expressions 
are presented using Python’s operator syntax. 


In summary we show the following: 


e Insight in the ability of an encoder decoder attentional 
GRU to extract semantic and syntactic meaning from 
mathematical expressions. 


e Simultaneously test whether these model’s incorrect 
predictions may be leveraged to auto generate multi- 
ple choice question distractors commonly seen in lower 
education exams. Continuing with this potential ap- 
plication, experiment with the practice of character 
offsetting—a modified greedy decoding schema which 
pushes the networks to predict four separate sequences 
instead of a single output thus providing a complete 
set of distractors. 
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e Qualitatively measure which question types show the 
highest potential for leveraging character offsetting for 
the purpose of multiple choice distractor generation. 


2. RELATED WORK 
Math Word Problems 


Early applications of machine learning and deep learning 
based methods in math questions attempted to convert math 
word problems into equations {10}. Here we rely on the 
capability of neural network to identify the textual and 
equation parts from the question and model them correctly 
in order to solve them. Much of the work in math word 
problems relied on extracting equations from text and then 
solving them using symbolic solver libraries such as Sympy. 


Subhro Roy and Dan Roth created expansion trees and 
Unit Dependency Graphs from arithmetic word prob- 
lems. Kushman et. al. mapped word problems to 
equations using canonical templates and handled ambiguity 
using probabilistic models. Tencent AI Lab first used 
deep neural networks for solving math word problems. They 
used the Seq2Seq model with LSTM units for mapping math 
word problems to the equations. MathDQN proposed 
using Deep Q-Learning to map math word problems to solv- 
able equations. Deepmind attempted to solve math 
word problems using the program induction technique, which 
would also generate a rationale for choosing an answer. This 
method did not involve mapping problems to equations but 
the reasoning in text form. Recent work focused on using 
recursive neural networks for evaluating equations 
mapped from word problems. 


Polozov et. al. and Liu et. al proposed methods for 
generating math word problems. Liu et. al. used Gated 
Graph Neural Networks with Variational Autoencoders to 
generate questions from knowledge graphs of mathematical 
concepts and symbolic expressions. take programming 
approach for encoding requirements from tutors and students 
to create logical graphs with the help of ontology. This logical 
graph is then used to create expressions and sentences with 
the help of primitive templates. 


Lample et. al. showed solutions for questions in differen- 
tial equations, differential calculus, and integral calculus and 
used transformer networks to solve calculus questions 
and compared their results with traditional solvers like Math- 
ematica and Matlab. In contrast, Saxton et. al. created 
a codebase to generate math problems across fifty-six classes 
and solve them using deep neural networks. Saxton et. al. 
compared problem solving abilities of Seq2seq networks with 
transformers. Though transformers showed better results 
than the the recurrent models, but Saxton et. al. commented 
that the improvement in performance was largely due to the 
higher capacity of transformer networks to remember rather 
than their ability to solve. 


Here we use the codebase created by Saxton et. al. [22] to 
generate math questions. Even though the codebase in its 
original form cannot generate distractors, it can be modified 
to create distractors using simple rules. In comparison to 
other datasets which contain distractors, we chose to 
use the codebase since it provided more control over the 


generation of questions and also the templates used have a 
simpler language and an equation with each question. 


In a classroom and tutoring settings math questions are 
more open ended. Erikson et. al tested the capability of 
XGBoost, Random Forests and LSTMs in analyzing the open 
ended answers in mathematics. These models were created to 
assist the teacher rather than complete automated grading. 
Michalenko et. al used LSTMs to solve polynomial 
factorization problems. They created their dataset from 
Wolfram Alpha. They use the trained network for automated 
grading and personalized feedback system. 


Distractor Generation (DG) 


In multiple choice questions, the options which are not the 
answer are called distractors, because their job is to distract 
a student from a given correct answer. Distractor generation 
has been studied for non-mathematical subjects, especially 
English (Susanti et. al. [23)) and other domain-specific tasks 
(Aldabe et. al. [2]). Distractor generation for scientific 
subjects like physics, chemistry, biology, and economics was 
explored by Linag et. al. [13]. They used a two-stage 
model with a classifier and a ranker to filter out the relevant 
distractors. Linag et. al. explored distractors for fill in 
the blank type questions using GAN networks. 


Partial Credit Scoring 


Similar in spirit to our partial credit scoring system, Pho 
et. al. attempt to automatically score the quality of 
manually created English multiple choice distractors using 
various semantic and syntactic criteria including WordNet. 
This is fundamentally different from our problem however 
as we seek to automate the generation of the distractors 
themselves and simultaneously sought out a metric to help 
measure the fundamental reasonableness of those distractors. 


3. EXPERIMENTS 


3.1 Training Data 

The data set used in this paper had the express pur- 
pose of being a large scale training and testing framework 
for benchmarking models on mathematical reasoning. The 
framework consists of both training and testing sets. The 
training set consists of 39 different math problem types and 
variants of 17 of those add the element of mathematical 
composition to the problem’s statements for a total of 56 
question types organized into 8 different domains—probability, 
polynomials, numbers, measurement, comparison, calculus, 
arithmetic, and algebra. Each question type within a do- 
main is split into three training sets, easy, medium and hard 
of 666,666 question answer pairs, for a total of 2 million 
examples per question type. 


Difficulty measures the relative complexity of coefficients in 
the expressions generated. As an example compare from 
the polynomial evaluation set the easy: ‘Let u(q) = q**2 
- 6*q - 10. Calculate u(8).’, medium: ‘Let s(v) = v**3 + 
AT*y**2 +471*v + 142. Give s(-33)’, hard: ‘Let h(a) = 
-177071*a - 4957992. What is h(-28)?’ and an actual related 
college algebra exam question |25]: ‘Evaluate the function 
f(x) = 3 + (x-5)**(1/2) at x = 9.”. It is relatively clear that 
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examples from medium and hard are unlikely to appear on a 
low level math examination. For this reason the majority of 
experiments rely on the easy train set variants. It was our 
hypothesis that as the curriculum provided by these sets are 
much more in line with the expected complexity of questions 
appearing on actual low level exams that the models would 
thus be more likely to generate better distractors (or even 
the correct answer) when provided such an exam’s questions 
as input. 


The primary test method within the framework proposed by 
is a data set referred to as interpolation. Every question 
type has an associated interpolation test set. The set consists 
of 10° question answer pairs likely unseen in either the easy, 
medium and hard train sets of the associated question. The 
guarantee of lack of repeated questions comes from a lower 
bound on the probability of a questions repeated generation. 
A training question at most has a 10~® chance of reappearing 
in the test set. also release a secondary test set referred 
to as extrapolation, which measures generalization of core 
concepts across multiple question types. However, as we 
specifically were interested in single question trained models 
for the express purpose of multiple choice question distractor 
generation this test set was unused. 


3.2 Rule Based Distractor Generation 
Evaluation of distractors is not an exact process. For a given 
question there can be any number of distractors, some good, 
other bad. There is usually a very loose concensus on what 
constitutes a good distractor. Also no algorithm exists that 
can gauge the ”goodness” of distractor. However, expert 
educators have a keen sense of judging the distractor by 
their teaching and research experience. Educators usually 
know where students make mistakes, and leverage that to 
generate good distractors. For simple high school questions, 
we can simulate this process by creating rules that mimic 
the mistakes made by students. These rules then can be 
embedded with a mathematical solver to produce distractors 
for given question. A simple set of rules can be created to 
modify the solution steps to generate distractors for questions 
containing mathematical equality or inequality. Commonly 
used rules are: 


e Change One Sign : Randomly pick one coefficient or 
constant in equality/ inequality and multiply by —1. 


e Change Two Signs : Randomly pick two coefficients or 
constants of equality /inequality and multiply by —1. 


e Most Frequent Number : Use the most frequent number 
in the equality/inequality as a distractor 


e Nearest Multiple : Randomly pick a coefficient or a 
constant in equality/inequality and change it to the 
nearest multiple of 2, 3 or 5. 


e Random Drop : Randomly drop one of the coefficients 
or constant in equality /inequality 


e Invert Range : Invert the solution range of the inequal- 
ity, e.g. change [0, 1] to (—oo, 0) U (1, co) 


e Trivial Solution : For inequality problems, one of the 
distractors can be chosen from {¢}, (—00, co), or No 
Solution. For equations, choices are from 0, —1 or 1 


e Flip brackets : Change an open bracket in answer to 
closed and vice and versa. In the question, order of 
operations can be changed by changing the position of 
brackets. 


These rules can be coded as python functions and then 
selected one or two rules at random to modify the steps 
involved in solving the question. Symbolic library like sympy 
can be used for generating and solving the math questions. 
The library developed by deepmind can create math 
questions across various domains with varying difficulty level. 
We modify their codebase to extend its capability to generate 
the distractors based on the stated rules. Table[I]shows few 
examples of rule based distractor generation. 


Question Answer Distractors 
Let —22 2 > ®_4. : 7 SP <0 
Whates? - | “S?Ss eS des 
A<p<aw 
: : — <w < oo 
pane aaa ae —183 <w< oo ee <w < x 
2 6. TL aE —co<w< a 
Solve the polynomial f4—-3 
inequality: f#% fH -& 
51-3f 4A-f— f#—-2 


Table 1: Distractors generated using rules 


Distractors generated using rules act as a form of reference 
distractors. For qualitative evaluation of distractors gener- 
ated by neural networks, we will look at both the distractors 
side by side in table 3] 


3.3. Experiment Detail 

Two principle experiments can be identified. An attentional 
encoder decoder GRU is trained on a single question 
type for the entire 666,666 easy train set. Keeping with the 
spirit of the framework released by we after training 
a model on a specific question set test the models on their 
respective question’s interpolate test rather than a subset of 
the train set. 


Simultaneously, during the second round of data collection 
with the GRU, when the model is scored on the interpolate 
set we perform character offsetting (see [3.3-1) and ask the 
model to predict 3 distractors in addition to a primary so- 
lution sequence. Two different scores were calculated for a 
model’s performance on the test set—the first, a complete 
binary accuracy where credit is assigned if and only if the 
entire primary greedy decoded sequence matches the true 
solution sequence. And second, a partial credit score which is 
calculated by subtracting from 1 the normalized Levenshtein 
edit distance between the predicted and true solution. Nor- 
malized in this context means the ratio of the edit distance 
to the max sequence length of either the true or predicted 
solution sequence. Thus for a given Levenshtein distance d 
for solution sequence S and prediction sequence P we have 
partial credit defined as 


d(P, 8) 


max(len(P, S)) (1) 


partial credit = 1 
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Encoder 
Context 
Embedding 


Time step tp 
Vocab : likelihood 
-:0.60 
2:0.20 
1:0.19 
+:0.01 


Decoder 


Figure 1: Example of character offsetting given an encoder’s 
context embedding from the input question ‘-2 + 1’. Note 
decoder hidden states not shown in diagram. <SOS> signifies 
the start of sequence character. 


It is important to acknowledge the widely different possible 
interpretations for what partial credit could mean in this 
context and why we choose the definition we did. In a 
typical educational setting, partial credit is assigned when 
the student demonstrates sufficient understanding of the 
problem albeit fail to arrive at the final correct solution. 
This is difficult to measure in model outputs—consider a 
correct sequence being —2 and a predicted sequence of 2, 
the partial credit score would be 1 — ; = 50%. In real life 
a teacher may realize a student failed to report a final sign 
change and assign significant credit. Such thorough review 
is impossible given the 100,000 examples within a single 
question type’s test set (and also impossible considering the 
model’s inability to provide secondary and tertiary decision 
making steps)—and so we formulate the above definition as 
an attempt to empirically measure the ability for a model to 
predict similar expressions to the correct one. We emphasize 
this is not any attempt to measure the models comprehension 
of the mathematics it is predicting on. As for why this is 
important; the ultimate goal is to generate multiple choice 
question distractors which generally exhibit some form of 
similarity to the correct solution. We discuss defining a good 
distractor more fully in section [3.3.2] 


3.3.1 Character Offsetting 

We propose a modified greedy decoding schema called char- 
acter offsetting for generating multiple choice question dis- 
tractors. In a typical greedy decoding scheme for an encoder 
decoder sequence to sequence model an input sequence (in 
our case, a string literal representation of a math problem) 
is given to the encoder which generates a context embedding 
[24]. Then character by character the decoder outputs a 
response sequence based on this context. At every time step 
of the output sequence’s prediction, the previously predicted 
character and hidden states, as well as the encoder’s context 
output, are re-fed into the decoder. The actual output of 
the decoder at any step is a probability array the size of the 
model’s vocabulary ig). The highest value corresponds to the 
most likely next character in the sequence, at least according 
to the model’s weightings. In a greedy decoding at every 
step we simply take the most probable character and append 
it to the final output. Prediction is halted once the model 
outputs the end of sequence character as the most probable 


next step [3]. 


In character offsetting we freeze the initial decoder returned 
probability array and hidden states. We now ask the decoder 
to generate four total prediction sequences—one being your 
expected greedily decoded output, a second with the second 
most likely character from the frozen initial probability array 
as the sequence’s starting character, and similarly a third 
and fourth. Each time a new sequence is attempted, we reset 
the hidden state to the saved initial hidden tensor. This 
was found for several question types to generate diverse and 
reasonable incorrect outputs. Table [4] provides a qualitative 
ranking based on question type for this task. 


3.3.2 Difficulty in Defining a Good Distractor 
Defining a good distractor is a non-trivial endeavor, and 
we make no claim to have accomplished this in this paper. 
Rather we discuss qualities typically considered when trying 
to formulate distractors for a multiple choice assessment. 


Some qualities are readily apparent-general reasonability of 
a distractor as a possible solution to the question posed is 
perhaps the most fundamental requirement [7]. Measuring 
reasonability may be accomplished in several ways. Differ- 
ence in value between a distractor and the true solution are 
potentially a good baseline—a distractor should be within a 
context specific similar magnitude as the true solution to 
avoid immediate exclusion. For lower level maths such as the 
the problem types discussed we believe this to be typically 
within a magnitude difference. 


3.4 Model Parameters and Training Procedure 
The model experimented with was an encoder decoder atten- 
tional GRU trained on a single NVIDIA 1080TI GPU for a 
single train-easy curriculum question type from the data 
set. The models encoder and decoders had an embedding 
layer of size 512, with the decoder having 16 attentional 
heads. Initially what was attempted for a given training 
question type was an encoder decoder hidden size of 2048 
on a batch size of 256. If the 1080TI GPU memory was 
insufficient given a training question type then we alternated 
between dropping encoder size and batch size. The parame- 
ters for a given question type are recorded in table [4] 150 
training epochs were performed. 


We follow most of the parameters used in [22|the Adam 
optimizer [9] was selected for minimizing the sum of the 
log probabilities of the correct character with learning rate 
lr = 6* 10-4, and fi = 0.9, B2 = 0.995, and « = 10°° 
and absolute gradient clipping of 0.1. The model leveraged 
teacher forcing during training and used 0.9/0.1 split of 
training data into a train and validation set. 


4. RESULTS AND ANALYSIS 
4.1 Attentional GRU on the Interpolate Set 


4.1.1 Performance Considerations 

It should be noted that these models have removed from 
them the greater context provided by the train-medium and 
train-hard data sets, of which interpolate attempts to test 
understanding for as well. It is possible as well that the hard 
or medium sets better generalize to interpolate for specific 
question sets. A small test seems to support this—we let the 
model train on the hard variant of algebra_linear_ld and 
scores improved from 3.9% to 44.3%. 
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Metric Mean Score 
Binary 0.065 
Partial Credit 0.679 


Table 2: Comparison of attentional GRU partial credit and 
binary scoring performance averaged across all questions not 


disqualified in This includes low potential questions 
excluded in table 


M@ NoPpc W@W Pc 


comparison__closest 
numbers__round_number_composed 
numbers__is_factor 
algebra__linear_ld_compcse Ee 
numbers_ place v’w—h————__—————— 
numbers_ round] _——— 
numbers__place_value_composed == 
comparison__pair_composed 
comparison__pair 
comparison__closest_composed EEE 


0 0.5 1 


Relative Score 


Figure 2: Top ten scoring questions when evaluated with the 
partial credit (PC) metric and no partial credit (binary) (No 
PC) score. 


4.1.2 Analysis: GRU Performance Binary Scoring 
Performance when scored with no partial credit varied widely. 
Of the top ten scoring question types 5 are from the com- 
parison module type. Based on overall poor performance 
we are skeptical that the models extrapolate true mathe- 
matical semantic meaning—rather they likely just determine 
meta solution strategies. For example, one hypothesis for the 
comparison task success is that the notion of magnitude is 
readily apparent based on the length of the input sequence. 
Two observations about this—the first is that this requires the 
model to have gained the ability to isolate critical numeric 
subsequences within a larger question prompt. Secondly, 
providing contrary evidence to this hypothesis of sequence 
length metagaming, is that even for examples comparing 
long decimal sequences of lesser magnitude to short whole 
integers of greater magnitude we observe successful predic- 
tions. Lastly, the implication of signage in comparison was 
understood well by the models. 


Generally it would appear that magnitude is an easier concept 
for statistical pattern recognition to abstract. Most difficult 
for the network was the evaluation of polynomials and other 
algebraic expressions, and arithmetic. Binary scores in all 
these categories were low—even given the considerations pro- 
vided in section |4.1.1 


4.1.3 GRU Performance Partial Credit Scoring 

Measurement with the partial credit metric (table [2] figure [2p 
demonstrates greater consistency and performance and shows 
promise for the ability of the attentional GRU to capture 
the essence of a reasonable response and work as a distractor 
generator for multiple choice questions—especially for types 
of problems the model performed poorly on using a binary 
scoring metric. However we note some flaws in the mea- 
surement. An example: in algebra_linear_2d_composed the 


model would frequently predict a single negative sign, a safe 
prediction and given the length of correct outputs is typically 
only one or two characters this led to a significant boost in 
score while answers remained meaningless. Interestingly in 
the question set’s non-composed variant algebra_linear_2d 
the model’s outputs are mostly meaningful and the partial 
credit score seemingly justified. To supplement the partial 
credit score, a qualitative examination of the reasonability 
of model’s outputs compared to their empirical partial credit 
scoring is provided in table [4] 


4.2 Multiple Choice Question DG 


4.2.1 Considerations 

The framework releases a wide range of question types 
posed in diverse formats. Not discussed until now is that 
several formats are not conducive towards training models. 
Take for example questions from the comparison_closest set 
which are themselves posed as multiple choice questions— 


‘Which is the nearest to -955? (a) -3/4 (b) 0.2 (c) 17/3 (d) 
3/5 (e) 0.5" 


A model is supposed to predict either a, b, c, d, or e. Of the 
data sets released only four were found to use such a format 
for some or all questions within the set. A similar problem; 
six sets were either partially or fully posed as simple true 
and false questions. Naturally such questions are removed 
from consideration from our goal of distractor generation. 


Partial credit was found to be an effective indication for 
many question types of whether a model’s principle predicted 
sequence captures what a reasonable response should look 
like. Some faults exist however—consider the question type 
comparison_sort. An example: ‘Sort -1, 0.3, -6, -24/11, 3, 5, 1 
in descending order.’ with primary prediction ‘5, 3, 1, -0.3, -1, 
-24/11, -6’. Observe the —0.3 in the prediction output—a value 
which is not even an option given in the problem statement! 
Such an example would not make a worthwhile distractor 
as it fails to test for the mathematical notion this question 
fails to over a high partial credit score. In table[4] we provide 
a qualitative review per question type of character offset 
predictions for multiple choice question distractor generation. 
By comparing to their respective partial credit score we find 
that generally a high score is an indicator for character offset 
predictions to also be reasonable. 


4.2.2 Character Offset DG: Interpolate Sampling 
The following are a couple of curated model responses—the 
order of the distractors matches the ordering of the probabil- 
ity of the initial character offset. So the first value listed is 
the model’s primary greedily decoded prediction, the second 
is the sequence generated when we force the initial character 
to be the second most probable, and so on. If the model 
predicts the correct solution it is bolded. 


‘solve -2*s - 40 = -2*j, 53*j - 62*j + 245 = 4*s for s.’ 
output: 5, -5, 4, 1 


‘let i(a) = -a**2 + 1319*a - 22130. calculate i(17).’ 
output: 4, -4, 5, 14 
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Here we see a pair of ideal outputs from the algebra_linear_2d 
and polynomials_evaluate sets respectively. Not only did the 
model successfully predict the correct solution sequences of 
5 and 4, but also produced noteworthy distractors. Observe 
the similar magnitude and the prediction of -5 and -4-sign 
changes of the true solutions and a powerful arithmetic skills 
check. 


Quantitative measurement of the potential for character off- 
setting in generating multiple choice question distractors is 
a difficult endeavor due to the lack of formal definition of 
the problem. However, we observe general groupings of ques- 
tions which seem to have potential for the use of character 
offsetting as a computationally cheap method of doing so. 
Table [4] is a qualitative ranking when given a random subset 
of outputs for each interpolate question type, whether offset 
predictions are not reasonable (low, not reported), whether at 
least one offset sequence is reasonable (medium), or whether 
most offset sequence predictions are reasonable (high). We 
view a prediction as unreasonable if it is mathematically 
meaningless or as mentioned in either unreasonable 
given the problem’s context or disqualified due to format. 
We find a higher partial credit score of the primary predic- 
tions frequently but not always aligns with whether offset 
predictions would also be reasonable. 


Table [3] shows the comparison of the model generated dis- 
tractors and rule based distractors for five questions. We 
generated four distractors for each question from the model 
and generated ten distractors from the rule based system. 
In the given table we show the most matching distractors. 
As we can see we get two matches each for questions (1), (4) 
and (5). There are three matching distractors for question 
(3) while there is no matching distractor for question (2). 
Despite being no matching distractor, our model gets the 
form of the distractors correct, and the predicted distractors 
at a glance can be used on a real quiz. In question (4), we 
can see that the model gets the multiplicity of 120 right and 
tries to stay around it. From the above examples we can 
see that our model first tries to get the form of the answer 
correct and then aims for computational compositionality. 


4.2.3 Character Offset DG: Standard Exam Sampling 

We sample several actual standardized exam questions from 
the SAT, ACT, and a college algebra midterm. Questions 
are altered before being fed to a model so that mathemati- 
cal syntax matches Python’s. Interestingly models appear 
resilient to significant changes in the question’s formulation. 
Correct exam solutions are bolded, and generated options are 
in order of the probability of the initial character offsetting. 
The exam distractors are provided for comparison below but 
are removed before the question is fed to a model. 


‘What is the greatest common factor of 42, 126, and 210? 
A) 2 B) 6 C) 14D) 21 E) 42’ 
output: 42, 6, 21, 12 


An interesting example |1| as the numbers_gcd data set only 
ever presents two values to find the gcd of, while the above 
presents three. Not only does the model predict the correct 
solution, but two distractors also used in the actual exam. 


‘Evaluate the function f(x) = 3 + (x-5)**(1/2) at x = 9. A) 
1B)5C)6D)7 
output: 5, 49, -5, 9 


Again we observe relatively reasonable responses given ques- 
tion formulations which diverge significantly from the tem- 
plates trained on. This question is similar to a polyno- 
mials_evaluate type. However, the difference is no fractional 
powers exist in the train-easy set—yet even with the power 
symbol being replaced by the unknown character the model 
still generates valid distractors (and the correct solution, but 
this is clearly by chance as the model has no knowledge of 
roots). 


5. CONCLUSION 


5.1 Summary 

Two experiments quantitatively showed that a GRU has 
mixed results when attempting to solve elementary math 
problems. Our alternative goal of multiple choice distrac- 
tor generation for several question types typically found in 
pre-undergraduate education by applying a modified greedy 
decoding schema referred to as character offsetting was suc- 
cessful. Evaluation using an edit distance based partial 
credit scoring metric as opposed to a binary one demon- 
strates greatly increased consistency and performance for 
capturing a reasonable response. We found the following: 


e Generally the easiest math problem types for a GRU 
is comparison tasks which is not surprisingly since 
this is a fundamental problem encountered early in 
education. It would appear the ability for GRUs to 
abstract mathematical knowledge is minimal. 


e The ability for networks to capture the essence of a 
reasonable response for several question types is shown. 
Leveraging the proposed practice of character offset- 
ting we show that these networks can cheaply generate 
distractor options for multiple choice questions. 


5.2 Future Work 


It would be interesting to compare a beam search decoding’s 
non principle predicted sequences to those produced by char- 
acter offsetting and whether for certain question types more 
worthwhile distractors are produced. The general capability 
for character offsetting to produce at least one worthwhile 
distractor for the medium potential questions listed in table 
[4] hint that with some refinements to the decoding schema or 
training parameters could potentially become high potential 
question types. 
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Supplementary Material - Appendix 


Distractors | Distractors 


S.No. Question Answer (Model) (Rule Based) 
—1 —9 
1 express —105c? — 5c? + 7e — 50c + 41c¢ + 332c? _9 0 0 
in the form rc? + ic? + b+ uc and give u. 3 —16 
2 2 
—s —45s 
9 | let k(w) = 2w? — 4w? + 3w?. let z = —29 + 33. Aba! 8s4 45s4 — 4 
let r(o) = —4 — 390? + 840? + z. give r(k(s)). i 9s* 135s 
1834 0 
1 1 
let u be 1/(114/56 — 2). 4 9 
3 suppose —3d = —3p + 24, —16d — u = —4p — 13d. -1 2 2 
solve c— 32+ 2 = —c, 0 = —4c — pz — 4 for c. 0 0 
—1200% —1206° 
4 | let v be 4/(—14) + —1* (—596)/28. let a = v — 15. what is the third | 7 9)3 1200 120b% 
derivative of —6b? — 9b + 23b% — 8b° wrt b? 60b3 —720b3 
240b3 360b3 
18 7 
5 let a(h) = 2h? — 9h +4. suppose —26w — 20w = —55w + 126. rl 8 8 
what is the remainder when a(—7) is divided by w? 21 3 
9 9 


Table 3: Qualitative Comparison of Distractors Generated using Neural Network Model and Rule Based System 


[ Potential | Question Encoder Hidden Size | Batch Size | Partial Credit Score | Mean Score 
algebra_linear_2d_composed 2048 128 0.837 
High | algebra_linear_2d 2048 256 0.811 0.766 
algebra_linear_ld_composed 2048 128 0.887 
algebra_linear_1d 2048 256 0.694 
algebra_sequence_next_term 2048 128 0.774 
arithmetic_mul_div_multiple 2048 256 0.772 
arithmetic_nearest_integer_root 2048 256 0.730 
polynomials_evaluate_composed 2048 128 0.754 
polynomials_evaluate 2048 128 0.709 
polynomials_expand 512 128 0.642 
polynomials_coefficient_named 2048 128 0.733 
numbers_gcd_composed 2048 128 0.760 
numbers_gcd 2048 256 0.762 
numbers_lcm_composed 2048 128 0.737 
numbers_div_remainder_composed | 2048 128 0.800 
numbers_div_remainder 2048 256 0.762 
numbers_place_value_composed 2048 128 0.859 
algebra_sequence_nth_term 1024 128 0.507 
Medium | arithmetic_add_or_sub 2048 256 0.501 0.624 
arithmetic_mul 2048 256 0.451 
arithmetic_div 2048 256 0.559 
arithmetic_mixed 2048 256 0.688 
arithmetic_add_sub_multiple 2048 256 0.755 
arithmetic_add_or_sub_in_base 2048 256 0.682 
calculus_differentiate_composed 1024 128 0.589 
calculus_differentiate 512 128 0.730 
measurement_time 2048 256 0.838 
numbers_lcm 2048 256 0.564 


Table 4: Qualitative ranking of the potential for models to use character offsetting for generating distractors based on observed 
predictions on interpolate. Questions not listed are those whose predictions were generally unreasonable as defined in [42.2] or 
disqualified due to formatting mentioned in [4.2.1] Model specifications are included as well. Decoder hidden size was 2048 for 
all models. Encoder/Decoder embedding dimension and number of attentional heads was 512/512 and 16 respectively. 
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ABSTRACT 


Collaborative dialogue is rich in conscious and subconscious 
coordination behaviours between participants. This work 
explores collaborative learner dialogue through theories of 
alignment, analysing inter-partner movement and language 
use with respect to our hypotheses: that they interrelate, 
and that they form predictors of collaboration quality and 
learning. In keeping with theories of alignment, we find 
that linguistic alignment and gestural synchrony both corre- 
late significantly with one another in dialogue. We also find 
strong individual correlations of these metrics with collab- 
oration quality. We find that linguistic and gestural align- 
ment also correlate with learning. Through regression anal- 
ysis, we find that although interconnected, these measures in 
combination are significant predictors of collaborative prob- 
lem solving success. We contribute additional evidence to 
support the theory that alignment takes place across multi- 
ple levels of communication, and provide a methodological 
approach for analysing inter-speaker dynamics in a multi- 
modal task based setting. Our work has implications for the 
teaching community, our measures can help identify poorly 
performing groups, lending itself to informing the design of 
real time intervention strategies or formative assessment for 
collaborative learning. 


Keywords 
Dialogue, Gesture, Natural Language Processing, Alignment, 
Collaborative Learning 


1. INTRODUCTION 


Collaborative problem solving has long been a focus of ed- 
ucational research, and has been deemed an educational 
learning objective of critical importance in the 21st century 
workforce [12]. In the education literature, collaboration 
success is often analysed with respect to a joint problem 
space as created through learner interaction [31]. This joint 
problem space integrates learner shared goals, descriptions 
of the problem state, awareness of available problem solv- 
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ing actions and the associations between these aspects. The 
emergence of this shared conceptual space is constructed 
through shared language, situation and activity. 


Alignment in dialogue, the language component of this inter- 
action, is also commonly attributed to an automatic mecha- 
nism to achieve a shared understanding, or situation model 
[27]. This account of dialogue predicts alignment across 
various modes of communication, from word level to ges- 
ture and gaze patterns. Specifically in a task based setting, 
alignment is thought to aid mutual understanding [10, 27]. 
These theories of alignment and collaborative learning when 
taken together suggest that convergence at many levels of 
communication will take place in parallel over the course of 
collaborative dialogue, and that this alignment will be in- 
dicative of collaborative success. Additionally, the collabo- 
rative learning literature suggests that the effort necessary to 
build shared understanding is what actually leads to learn- 
ing [40], thus alignment, already found to be predictive of 
student learning in a teacher student context [42], may also 
be indicative of this. 


Investigating collaborative problem solving through the lens 
of alignment can give additional insights to this complex 
problem of convergence [38]. In this work, we examine the 
synchrony and convervence between students at both a lin- 
gusitic and gestural level, via inter-student metrics of lin- 
guistic alignment and movement synchrony. Of particular 
interest in this study is the separate coding of collaboration 
and learning in the learner dialogues. This allows for side by 
side comparison of the different modalities, and the analysis 
of their interaction with respect to these outcomes. Ges- 
tural and linguistic coordination between locutors has long 
been linked in dialogue, both properties having been indi- 
vidually explored for facilitating collaboration and learning 
in various settings. 


We offer an exploration of theoretically motivated metrics 
to capture synchronisation and alignment at the levels of 
linguistic expressions and movement patterns. We explore 
correlations between the measures themselves and between 
collaboration and learning. Exploring the relationship be- 
tween these measures we find strong correlations between 
modalities, in line with the collaboration and alignment lit- 
erature. Finally we explore the combination of these modal- 
ities in their predictive power for both collaboration and 
learning, finding that although they are interrelated, each 
plays a significant role in prediction quality. 
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1.1 Research Questions 

Motivated by the hypotheses we draw from the literature on 
collaboration, learning and alignment, we hypothesise that 
more successful learner dyads will converge in both their 
language use and their movement patterns to a visible de- 
gree, as the students align their mental models during the 
learning process. We also hypothesise that together, these 
aspects of interaction can provide useful tools in the analysis 
of student learning. We split our analysis into the following 
research questions: 

RQ1: Evidence of Convergence: Are linguistic align- 
ment, gestural synchrony and convervence effects between 
students higher than by chance task and vocabulary effects? 
RQ2: Convergence Vs Collaboration & Learning: How 
do our measures of convergence correlate with collaboration 
and learning? 

RQ3: Interaction: Linguistic Vs Gestural: Do these 
modalities correlate with one another? 

RQ4: Multimodal: Do combined measures of syn- 
chrony predict learning and collaboration outcomes: 
are the measures in combination predictive of learning and 
collaboration? 


2. BACKGROUND 


Evidence of convergence. Experimentally, in a two per- 
son dialogue setting, speakers have been found to converge, 
each aligning to their locutor at many levels of communica- 
tion: lexical, structural, gesture and conceptual[27]. Speak- 
ers have been found to spontaneously coordinate body pos- 
tures and gaze patterns during conversation [35]. Behaviour 
matching in multimodal communication has also been found 
to be temporally synchronised in collaborative task-based 
activity, when participants are facing each other [22]. Im- 
itation or mimicry between people unconscious of this be- 
haviour has been found in incidental mannerisms such as 
the bouncing of a foot, or rubbing a nose [9]. People imitate 
one another in dialogue across many different modalities, 
including lexical choice [17], accent [18], pauses [7], speech 
rate [43] and syntax [30, 4, 26]. This imitation has been 
linked to having social benefits, for example, [9] find that 
speakers in those pairs where their incidental mannerisms 
were mimicked perceived the interaction as running more 
smoothly than those whose were not. In terms of collabo- 
ration, multi modal behaviour matching has been found to 
occur in a synchronous manner in a task based collaborative 
dialogue where rapport and its role in learning and conver- 
gence was investigated [22]. Acoustic prosodic entrainment 
has also been found to correlate with rapport, a social qual- 
ity of the interaction, in collaborative learning dialogues [23]. 
Parallel to this, it has been argued that infants’ early skills 
of joint attention is their emerging understanding that other 
people exist as intentional agents [8], as they develop the- 
ory of mind. In terms of learning, lexical entrainment has 
been shown to correlate with success in multiparty student 
engineering group project meetings [16], where higher scor- 
ing teams were more likely to increase their entrainment 
in project words over the course of a dialogue, while lower 
scoring teams are more likely to diverge. Alignment level 
has been shown to vary with student ability [36], and con- 
vergence of lexical and speech features from student to tutor 
in spoken tutorial dialogue corpora has been shown to be a 
useful predictor of learning [42]. 


Language and gesture in learning. Gesture has an im- 
portant role in teaching and learning [32], as does language, 
which, at a structural level, has been shown to exhibit effects 
characteristic of both learning and implicitness, thought of 
as an aspect of alignment or coordination between interlocu- 
tors [15]. A wide range of lingusitic features derived from 
student dialogues have been found to be effective predic- 
tors of both learning gains and collaboration quality [29]. 
Categorising gesture in an educational setting often adopts 
the framework proposed by [24], of separating them into 
four basic types: beat (gestures devoid of topical content 
yet which lend temporal or emphatic structure i.e. hand 
tapping, head movement for emphasis), deictic (concrete or 
abstract pointing i.e. to match an object referred to as ‘this’ 
or ‘that’, or a concept in the past that is being referred to), 
iconic (also referred to as representational, i.e. making a 
gesture of putting a phone to ones ear), and metaphor (ges- 
tures to illustrate abstract concepts, such as moving hands 
together to illustrate mathematical convergence, or drawing 
a trend line in the air to demonstrate positive correlation). 
While gestures are pervasive, not all types are equally repre- 
sented in particular speech events, and are very dependent 
on the dialogue context, however, various studies have found 
that gesture and speech together provide a better index of 
mental representation than speech alone, and to be an im- 
portant aspect in learning [19, 11]. 


3. METHODS 

Speaker Utterance 

Left: then we need to turn left . again put an if and then turn the 
view book . 

Right: so another if do statement ? 

Right: was it just when you write in the sensor here , right into that . i 
say talk to motor ? all right , a andb. 

Left: if it is that , then we take a left . then turn left . 

Right: turn left , which is right here , right , so 

Left: put this inside that and then again , we need to turn left . 

Right: another if statement ? remember , control . and then talk to 
motor again . turn left 

Left: turn left comes after . 

Right: and we need one for ... 


Table 1: Example dialogue excerpt. Expressions in bold 
indicate shared lexical constructions. 


Experimental Setup. The experimental setup consisted of 
40 pairs of undergraduate students participating in a col- 
laborative problem solving task of programming a robot to 
traverse amaze. The participants had no prior programming 
experience. The participants sat facing a computer screen, 
and were recorded as they worked through the shared ex- 
ercises. During the collaborative aspect of the task, the fo- 
cus of our analysis in this work, participant dialogue was 
recorded and subsequently transcribed. Body movement 
data was also recorded via a Microsoft kinect sensor. This 
resulted in timestamped language and movement data for 
the 30 minute period of the task. The participants were 
individually given a pre and post test on a similar set of 
exercises, in order to evaluate the relative learning in the 
dyads. The learning assessment consisted of four short an- 
swer or fill-in-the-blank questions that assessed their under- 
standing of basic computer science competencies (adapted 
from [5, 44]). Learning gains were computed by subtracting 
pre-test scores from post-test scores and divided by the total 
number of points to be gained minus the pre-test [13]. Col- 
laboration was evaluated on a series of axes derived from the 
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collaboration assessment measures proposed in [25]: sustain- 
ing mutual understanding, dialogue management, informa- 
tion pooling, reaching consensus, task division, time man- 
agement, technical coordination, reciprocal interaction, and 
individual task orientation’. Each dimension was given a 
score between +2 and -2 (each score was defined with con- 
crete behaviors in a codebook). Researchers double-coded 
20% of the sessions and had a Cronbach’s alpha of .65 (75% 
agreement). An overall measure of collaboration was defined 
as the aggregate of all collaborative features. More details 
of the study, the data, experimental setup and coding of col- 
laboration and learning scores can be found in Reilly et al. 
2019 [29]. 


Dialogue transcripts. An example snippet from a dialogue 
with high collaboration score is provided in Table 1. The 
transcripts occasionally include some utterances from the 
facilitator, for whom we are not interested in measuring any 
alignment effects. The facilitator typically will speak most 
at the beginning of the dialogue, thus we remove all intial ut- 
terances (including participant) which interleave facilitator 
and participant. For the remainder of the dialogue, facilita- 
tor utterances are simply removed. Punctuation, although 
added at the annotators discretion, is retained, as it pro- 
vides valuable information about the pace of the language 
used, and indicates the fragmented nature of these dialogues. 
The transcripts were tokenised before analysis using the nltk 
python package’. 


Figure 1: Example student movement pattern over time 
(left) and student geometries (right). 


Movement Data. Student movement was processed using 
motion sensors from microsoft kinect, returning a series of 


geometries representing the students’ body positions in space. 


An example of two students sitting at the table (which ob- 
sures the lower half of their bodies) can be seen in Figure 1. 
We only include data from the time period where the stu- 
dents were performing the collaborative activity, which cor- 
responds to the transcript data of 30 minutes per session. 
In terms of pre-processing choices, our interest in the move- 
ment data is where the students mimic the gestures of their 
partner independent on their position relative to the camera 
or one another. This allows us to abstract from the postu- 
ral shapes, relative size, and dominant hand of the partic- 
ipant’s gestural patterns. We thus use averaged 30-second 
time slices® of the movement data (a measure of between 
frame positional difference). We further process this data 


‘Detailed descriptions of these measures used can be found 
in [25] 
?NLTK[21] python package http: //www-ultk.org 


3 Our choice of 30-second slices was in part informed by pre- 
vious work [38], and through qualitatively examining the 


to account for the differences in movement which the ex- 
perimental setup introduces: we apply standardisation* to 
each participant signal in order that two signals of differ- 
ent means and standard deviations can be compared on the 
same axis. This grants us a measure of variance similarity, 
which captures better the elements of beat gesture patterns 
separate from absolute movement differences, as we know 
the students consistently display different mean movement 
levels across dyads. 


3.1 Computing Lexical Alignment 

We operationalise linguistic alignment in this work at the 
lexical (word) level, derived from the dialogue transcripts, 
extracting shared expressions, which we define as any se- 
quence of tokens which contain at least one word (e.g. single 
punctuation marks are excluded). The automatic extrac- 
tion of shared expressions per dialogue is an instance of the 
longest common sub-sequence problem [20, 3]. For each dia- 
logue, we extract the inventories of shared expressions using 
the method proposed by Duplessis et al. [14]. For each of 
the two dialogue-specific inventories of shared constructions, 
we compute the following measures: 


e Expression Variety (EV): The lexical diversity of the 
expression vocabulary. 


e Expression Repetition (ER): The ratio of produced to- 
kens belonging to an instance of an established expres- 
sion 


e Vocabulary overlap (VO): Captures the richness of the 
shared vocabulary, the ratio of shared vocabulary present 
in the dialogue between participants: 


#-(wordSspeaker1 a wordSspeaker2) 
#:(wordSspeaker1 U wordSspeaker2) 


Individuals repeat and introduce expressions at different rates 
within dyads, thus we additionally calculate dyad level mea- 
sures to capture the symmetry between interlocutors. 


e Expression Initiator (IE) Difference: Difference in % 
of shared constructions introduced by each dialogue 
participant. Initiator describes the dialogue partici- 
pant to first use a subsequently shared and repeated 
construction. 


||[E(speaker one) — IE(speaker two)|| 


e Expression Repetition Difference: The difference in 
proportion of an individual speaker’s utterances which 
contain a usage of a shared construction: 


||ER(speaker one) — ER(speaker two)|| 


These measures capture the between speaker repetition within 
dialogue, which we use as a proxy for measuring the coor- 
dination or alignment between the speakers. An example of 
expression repetition can be found in Table 1 


data at 5-second and 30-second slices, 30-seconds seems to 
be sufficient to capture interesting aspects of finer-grained 
hand movement, but not so fine as to render average total 
movement data meaningless 

4Standardising a time series dataset involves re-scaling the 
distribution of values, also known as Z-normalisation 
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3.2 Computing Gestural Synchrony 

We use Dynamic Time Warping [33] as the measure of sim- 
ilarity between partner movement patterns as it has been 
found to be a consistently robust measure of time series 
similarity [2], which although introduced as a method for 
analysing speech signal [33], has been employed successfully 
in the field of gesture recognition [1]. DTW is a technique to 
find the optimal alignment between two time series, through 
the stretching or compression of either series along its time 
axis. This warping can be used to find corresponding re- 
gions between two time series, and serve as a distance mea- 
sure. This measure of distance allows us to capture slightly 
out of sync movement patterns, or those of slightly differnt 
phase length, to be deemed more similar than their differ- 
ence in slope would suggest. Our main motivation for choos- 
ing DTW over other metrics demonstrated suitable for this 
task such as [38, 28] is our hope to capture the similarity of 
slightly asynchronus movement, which, if indeed movement 
and linguistic alignment is linked, should follow a similar 
mixed turn taking behaviour as dialogue [41]. As well as 
providing a robust distance measure between two time se- 
ries, DTW returns a warping path which describes in what 
direction the time series needs to be moved in order to best 
align: in other words, the warping path can provide useful 
information about leader and follower dynamics which we 
exploit in our analysis. 


Inspired by common measures of gestural synchrony /body 
language, we investigate the following dyadic behavioural 
properties: 


e Movement Difference (Mdiff): Global mean movement 
difference in a pair. Calculated as the absolute value of 
the difference between the means of each participants. 


e Movement Synchrony (dtw_dist): The synchrony be- 
tween pairs in terms of their movement patterns, as 
measured by the Dynamic Time Warping (DTW)[33] 
distance 


e Leader Follower dynamics (diffLF’): The directionality 
of the alignment of the movement between the pair. 
This metric is derived from the DTW path. 


Additionally, to measure whether these similarities become 
more pronounced as the session progresses, we divide the ses- 
sion in half by timestamp, and compute the measures per 
half. We use the difference between dialogue halves (second 
- first) as a measure to capture convergence. This results 
in three additional measures corresponding for those syn- 
chrony measures above: Mdiff_change, dtw_dist_change and 
diffLF_change. The aspects of the body geometries which 
we focus on consist of the points for Head, Hands, Shoul- 
ders and Total (average) movement. For hand and shoulder 
measures, these are defined as an average of the movement 
in the right and left points for each aspect. 


3.3 Measure Validation - Baseline 

A certain level of similarity between speakers will exist in- 
dependently of their adapting to one another. Due to their 
performing the same task, vovabulary will necessarily be 
constrained by topic, and consistent across pairings. Due to 


the experiment configuration, task specific gesture patterns 
such as moving the robot, or interacting with the computer, 
as well as to which side of them their interlocutor is will also 
lead to movement similarities, e.g. turning to the right vs 
the left to speak. We thus create baselines for both dialogue 
and movement data which demonstrate the levels of similar- 
ity inherent to the task setup. For the dialogue baseline, we 
create a scrambled version of the corpus by retaining the ut- 
terances of one of the students and interleaving it with utter- 
ances randomly drawn from another pair, per speaker. For 
a partner specific movement baseline, the movement data 
from each student is randomly paired with the data from 
another student on the same side relative to them as their 
partner was (i.e for each participant on the right hand side, 
replace their partner with a participant from the left hand 
side). To further check task specific effects of the seating 
configuration, we pair students sitting in the same position 
with one another, in order to confirm that the role does not 
show more similarities than the origional student pairings. 


4. RESULTS 
4.1 Analysis 1: Measuring Convergence 


Linguistic. We firstly hypothesise that there will be signifi- 
cant inter dyad repetition beyond what the task demands by 
chance, since alignment has been linked to both learning, as 
well as collaboration, and this same measure has found sig- 
nificant alignment levels in negotiation[14], as well as in sec- 
ond language tutoring dialogue[37], although this dialogue 
setting is different since both speakers are learners. Firstly 
we explore whether alignment is greater than by chance: we 
therefore compare the original dialogues to the shuffled base- 
line in the same manner as [14]. The expression variety is sig- 
nificantly higher for the original (mean=0.118, std=0.023) 
than for the shuffled dialogues (mean=0.110, std=0.015). 
Statistical difference is checked by a Wilcoxon rank sum test 
(U = 1141, p = 0.03 < 0.05, r = 0.21)° This indicates that 
there exists a richer and more dyad specific expression lexi- 
con. The expression repetition is also significantly higher for 
the original (mean=0.509, std=0.123) than for the shuffled 
dialogues (mean=0.487, std=0.109) (U = 1079.5, p = 0.014 
< 0.05, r = 0.25). This means that the level of repetition 
between student dyads is not simply incidental, and can be 
attributed to alignment or routinisation effects. Finally, as 
a measure of how task specific the vocabulary is, we find the 
vocabulary overlap between speakers significantly higher in 
the original (mean=0.509, std=0.123) than in the shuffled 
dialogues (mean=0.487, std=0.109) (U = 856, p = 0.0002 
< 0.001, r = 0.41). This difference demonstrates that stu- 
dents share a much richer vocabulary than would happen 
by chance in performing this task. Overall, these results 
show that the collaborative student dialogues constitute a 
richer expression lexicon than they would by chance, indi- 
cating that the students align to one another, resulting in 
their langauge converging [10, 27]. 


Gestural. We hypothesise that our measures of movement 
matching will result in higher partner-specific synchrony than 


°Following [14], for each test, we report the test statistics 
(U/W), the p - value (p) and the effect size (r) 
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in our baseline: i.e. lower distance between pairs collabo- 
rating than simply any student performing the same task 
with a different partner. We find that within dyad DTW 
similarity is significantly higher than in both the partner 
substitution baseline (t= -2.0401, p < 0.05), and the within 
side baseline (t = -2.0397, p < 0.05). This indicates DTW 
is a useful measure of movement similarity in this setting 
showing that this method is suitable for capturing partner 
specific effects of these movement patterns and can allow us 
to use this to compare similarities between dyads. 


4.2 Analysis 2: Convergence Correlated with 
Collaboration and Learning 


Lingusitic alignment. We hypothesise that alignment be- 
tween students will correlate with learning, since in other 
tutoring settings, it has been found to correlate with both 
learning gains [42] and linguistic ability [36]. Additionally, 
global language features of collaborative dialogue have also 
found to correlate with learning gains [29]. To answer RQ2, 
we compare our linguistic alignment metrics with learning 
and collaboration scores. We find support for our hypothe- 
sis about collaboration and alignment correlating. We find 
ER (r 0.680 p < 0.001), EV (r 0.622 p < 0.001), and VO 
(r 0.663, p <0.001) all correlate with collaboration using 
Pearson’s r correlation coefficient. This shows that inter 
partner repetition is important, and that students will con- 
verge even to less common language in a collaborative set- 
ting. We also find our between learner measures signifi- 
cantly correlated with collaboration, IE_diff (r=-0.570, p 
< 0.01) and ER_diff (r=-0.54, p < 0.001) meaning that 
smaller differences between learner initiation and repetition 
of shared expressions correlates with how well they collab- 
orate. EV (r=0.442, p =0.026), ER (r=0.442, p =0.006) 
and VO (r=0.349, p =0.034) also correlate with learning, 
although to a lesser degree. An intuition as for why, is that 
in other studies reporting alignment correlation with learn- 
ing analyse dialogues conducted in an asymmetric tutoring 
setting, where adopting the language of the teacher is a sen- 
sible learning strategy as it is assumed that this language is 
correct. In our case, since these dialogues are between peer 
learners, the learning outcome is somewhat dependent on 
the rapport within the dyad, and the information aligned to 
being correct. In other words, in some cases, the learners 
may be converging to a shared mental representation, but it 
may not be the correct one. In keeping with this observation 
of dyad rapport and equality, IE_diff (r=-0.487, p = 0.002) 
and ER_diff (r=-0.515, p = 0.001) both show strongly that 
more equal contributions from the students in terms of re- 
peating one another, and in introducing words upon which 
to align correlate with learning. 


Movement Synchrony and Convergence. We hypothe- 
sise that movement synchrony and convergence, as defined 
by DTW distance and its change over the interaction, will 
provide a robust measure of synchrony which will better dis- 
tinguish between dyads with differing activity levels, which 
in turn should correlate with collaboration and learning, in 
keeping with previous results with other measures in task 
based dialogue [22, 32, 28]. We compare our movement sim- 
ilarity metric with learning and collaboration scores. Overall 
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Collaboration Learning 


Collaboration Learning 


Figure 2: Gestural synchrony and convergence vs. Collabo- 
ration and Learning measures correlation with Pearson’s r. 
Significant p values reported in the text. 


as can be seen from Figure 2, we find average movement syn- 
chrony to correlate with both collaboration and learning. In 
terms of learning, DTW_dist measures for Head (r = -0.611, 
p > 0.001) Hands (r = -0.561, p = 0.002), Shoulders (r 
= -0.609, p > 0.001), and Mtotal (r = -0.611, p =0.006) 
all significantly correlate with learning to a strong degree. 
Convergence between dyads in terms of Mtotal (r=-0.519, p 
= 0.006) and Hands(r=-0.467, p = 0.014) also significantly 
correlate with learning. Finally, dyads becoming more dis- 
similar in terms of hand movement (having a stronger leader 
follower dynamic) also significantly correlates with learn- 
ing: Hands_diffLF (r=0.471, p = 0.013). With collabora- 
tion, Head Hands Shoulders and Mtotal all significantly (p < 
0.05) correlate. As does the diffLF for Hands (r=-0.42, p = 
0.029) and Mtotal (r=-0.519, p = 0.006). Overall, the results 
are intuitive: we find that more synchronus pairs as mea- 
sured by DTW distance significantly correlate with collabo- 
ration quality. We also find that convergence between dyads 
is present (negative correlation between dtw_dist change and 
Mdiff change show greater similarity between learners over 
time) and correlates with learning quite strongly for some 
movement metrics. We also see a positive correlation be- 
tween the diffLF change features, particularly with learning, 
indicating that while convergence of behaviour is important, 
some aspects of turn taking and initiative are separate to 
this. 


4.3 Analysis 3: Comparing Linguistic and Ges- 


tural Convergence 
Comparing Linguistic and Gestural convergence, we hypoth- 
esise these aspects of communication will correlate with one 
another, as previous literature suggests [17, 4, 22]. To an- 
swer research question (RQ3), we contrast the modalities 
themselves. We split this comparison to compare gestural 
and linguistic coordination. We hypothesise that movement 
synchrony and linguistic alignment will correlate strongly, 
due to the process of speakers’ alignment of shared men- 
tal representations taking place across various linguistic and 
paralinguistic levels [6, 27]: if dyads align at the lexical level, 
it is likely that the same process leading to this alignment 
will affect the gestural level also [27]. The DTW path al- 
lows us to capture the relationship between slightly offset 
movement patterns of beat gesture mimicry [24], and the 
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Figure 3: Movement measures vs. linguistic alignment mea- 
sures. Pearson’s r correlation coefficient. 


case where linguistic alignment in turn taking utterances is 
low, which can lead to more synchronised patterns of move- 
ment [32]. Previous work has also found evidence of the co- 
ordination of of lexical alignment and gestural behaviour in a 
multimodal [35, 9, 22] context, we thus hypothesise that col- 
laborative problem solving dialogues will also demonstrate 
this. We find significant correlation between linguistic align- 
ment measures and movement synchrony across all move- 
ment patterns (Figure 3), with strongest effects for head (r 
= -0.735, p-value: 1.225) and hands (r = -0.706 p-value: 
3.811), providing support that our hypothesis about lexical 
patterns influencing gestural alignment at the level of beat 
gesture and head nodding may be be true for this setting. 
In terms of convergence, there is a significant correlation 
with change in hand movement and expression variety (r 
= -0.387, p-value: 0.0458). Interestingly, the difference be- 
tween speakers lingusitic (Diff_IE, Diff_ER) patterns posi- 
tively correlates with both their difference in movement syn- 
chrony and speaeker divercence, indicating that asymmetric 
relationships between students are visible across modalities 
of communication. Inkeeping with our hypothesis we find 
strong negative correlation between between speaker differ- 
ence in movement and divergence (strong correlation be- 
tween similarity and convergence) with the linguistic mea- 
sures of convergence, providing supporting evidence for the 
hypothesis that lingusitic and gestural convergence are part 
of the same underlying communicative process. 


4.4 Analysis 4: Predicting Learning and Col- 


laboration 
Finally, in order to answer research question (RQ4), to find 
combined interaction effects of the various inter modality 
measures, we fit a series of mixed effect regression mod- 
els°. We hypothesise that while each measure individually 
is strong, and although the measures themselves are corre- 
lated, each modality will provide its own distinct informa- 
tion contributing to learning and collaboration aspects. We 
perform backward step wise model selection to select the 
best predictors, firstly fitting each model with all relevant 
variables and stopping only when all remaining terms have 
significance p < 0.05. Although RMSE and r? values of 


®To fit the data and perform the statistical tests within this 
paper, we use the Statsmodels python package [34] 


Table 2: Mixed effects Regression model multimodal results 


Formula 

Learning) RMSE:5.48 _7:0.90 

Learning ~ EV: ER + DifflE * DifflE + Handmean_movement.s0_diffLF Change + 
Shouldermean_movement.30_diffL Change 

Collaboration RMSE:0.15 _r2:0.999 

Collaboration ~ EV * ER + DifflE « Diff.ER + Head_movement_30_dtw_dist 
+ Handmean_movement.30_dtw_dist + — Shouldermean_movement_30_dtw_dist 
+  movement_total.30_dtw_dist + Head_movement.30_diffLFChange +  Shoul- 
dermean_movement.30_diffLFChange + movement_total.30_diffLF Change + 
Head_movement.30_dtw_dist_change + Handmean_movement.30_dtw_dist_change + 
Shouldermean_movement.30_dtw-dist_change + —movement-total_30-dtw_dist.change 
+  Head.movement.30_diffLF + Handmean_movement.30_diffLF + Shoulder- 
mean_movement_30_diffLF -+ movement_total_30_diffLF 


predicted data are highest when combining all factors, we 
wished to discover the minimally significant descriptive set 
of criteria in order to find more interaction in our results. 


Table 2 shows the minimal significant set of linguistic and 
gestural factors and their interaction in terms of their ability 
to predict the dependent variables of learning and collabo- 
ration. Each modality separately can form a good predic- 
tor of both alignment and learning in this setting. How- 
ever, this analysis offers strong support for the multimodal 
modelling of collaborative problem solving, proving that al- 
though correlating with one another, both linguistic and 
gestural aspects have an independent role to play when pre- 
dicting learning and collaboration. Broadly, from Table 2, 
the gestural features chosen indicate that both the measures 
of synchrony, and those for convergence (_change features) 
play a role in prediction. It is also clear that predicting col- 
laboration in this case is easier than learning. This may be 
influenced by ceiling effects or ease of pre-test being a lim- 
iting factor. e.g. a learner with very good pretest score will 
have hit a ceiling by the end of the session. 


5. DISCUSSION & CONCLUSION 


We find significant levels of both linguistic and movement 
synchrony in our data (RQ1). In answer to RQ2, we find 
our measures of linguistic and gestural alignment correlate 
with collaboration. In terms of learning, we find that the dif- 
ference in repetition between students negatively correlates 
with learning, that movement synchrony in general shows 
strong correlation with learning. In terms of RQ38, we find 
significant strong to medium effects when correlating mea- 
sures of ER and dtw_dist with one another. This contributes 
to a growing body of evidence in support of theories of in- 
teractive alignment emerging across communicative modal- 
ities. Finally, via regression analysis combining our metrics 
(RQ4), we find that although separately powerful, a com- 
bination of modalities can best explain collaboration and 
learning outcomes. Our findings show the importance of 
analysing between speaker dynamics to capture nuances of 
learning. Our findings also suggest the use of a multimodal 
approach for the best understanding of these interactions. 
We also contribute interesting new evidence adding to work 
exploring the relationship between linguistic alignment and 
gestural and movement similarity. Our findings, while lim- 
ited to a small specific setting, contribute evidence to sup- 
port existing theories of human cognition and alignment. 
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ABSTRACT 


Over the years, researchers have studied novice program- 
ming behaviors when doing assignments and projects to iden- 
tify struggling students. Much of these efforts focused on 
using student programming and interaction features to pre- 
dict student success at a course level. While these methods 
are effective at early detection of struggling students in the 
long run, there is also a need to identify struggling students 
during an assignment so that we can provide proactive in- 
tervention to prevent unproductive struggle and frustration. 
This work proposes a data-driven method that uses student 
trace logs to identify struggling moments during a program- 
ming assignment and determine the appropriate time for an 
intervention. We define a struggling moment as not achiev- 
ing significant progress within a certain amount of time, rel- 
ative to the amount of progress made and time taken in a 
sample student dataset. The paper describes how we de- 
termine significant progress and a time threshold for strug- 
gling students. We validated our algorithm’s classification 
of struggling and progressing moments with experts rating 
whether they believe an intervention is needed for a sample 
of 20% of the dataset. The result shows that our automatic 
struggle detection method can accurately detect struggling 
students with less than 2 minutes of work with over 77% 
estimated accuracy. Our work contributes significantly to 
building proactive immediate support features for intelligent 
programming environments. 
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1. INTRODUCTION AND BACKGROUND 


Computer programming is a challenging topic for novices. 
As a result, there has been an increasing interest in the early 
detection of struggling students for proactive intervention 
to reduce dropout rates and improve student learning in 
programming courses. 


Researchers have collected and studied student problem- 
solving trace data during programming assignments from 
various perspectives. Most of this effort has focused on an- 
alyzing student programming actions to reveal their behav- 
ioral traits and programming patterns. This research typi- 
cally uses manual inspection of the trace logs [4, 16, 9] or ap- 
plies machine learning models [7, 3, 1] to categorize student 
behaviors and discuss the characteristics of each category 
and their potential impact on or relationship with student 
performance. The contribution of these studies usually lies 
in helping educators understand novice learning processes 
and promote positive behaviors. Another strand of research 
that analyzed student trace data focuses on student compila- 
tion behaviors [11, 19, 2] and syntax errors and bugs [21, 10] 
in student traces. This research used statistical inferences, 
machine learning methods, and visualization techniques to 
explore the relationship between specific patterns and stu- 
dent success and identify novice students’ common mistakes. 
Some of these patterns are helpful for identifying students 
struggling with certain concepts. 


However, we found that there has not been enough research 
that uses student trace log to model their progress and iden- 
tify struggling moments during programming assignments. 
One major application of struggling detection is providing 
proactive hints in intelligent tutoring systems (ITS), as pre- 
vious research has shown that novices, especially those with 
low prior knowledge or experience, may not request on- 
demand hints even when they need them [18]. Prior work 
identifying struggling students in traces generally focused 
on early detection of struggling students determined by the 
assignment outcome [12, 5, 6] and are not suitable for iden- 
tifying struggle during programming assignments. 


In this work, we propose a novel data-driven approach to 
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identify struggling moments from student trace data. We de- 
scribe how we adopted the SourceCheck algorithm to model 
student progress during open-ended programming assign- 
ments and how to identify progressing moments and strug- 
gling moments from student progress. We used human ex- 
perts to evaluate the progressing moments and struggling 
moments classified by our method. Our initial result shows 
that human experts had a decent agreement with our al- 
gorithm and that it is possible to determine if a student 
is struggling during programming assignments within two 
minutes. To the best of our knowledge, this is the first at- 
tempt to use a data-driven progress measurement to identify 
struggling students. We also discuss how our method may 
generalize to other domains that meet certain requirements. 


2. DETERMINE INTERVENTION TIMES 


In this work, we consider a student to be struggling during 
problem-solving when they could not make enough progress 
within a typical amount of time. As such, to determine the 
proper time to provide proactive intervention, we need to in- 
vestigate: 1) how to measure student progress during solving 
a programming assignment, 2) what constitutes significant 
progress, and 3) how much time it typically takes a student 
to make significant progress, and finally, 4) what is the ap- 
propriate time threshold beyond which we consider the stu- 
dent is struggling and need help. This section describes how 
our algorithm models student progress and identifies poten- 
tial progressing and struggling moments through these four 
steps. 


2.1 Dataset 


We analyzed a dataset collected from an introductory pro- 
gramming course for non-computer science major students 
in Fall 2017. Students learned to program by doing a series 
of open-ended programming assignments adapted from the 
BJC curriculum [8] in a block-based programming environ- 
ment called Snap/. 


We extended the Snap/ environment to record all students’ 
programming actions into traces. Each trace, identified by 
a project id, contains all actions a student performed dur- 
ing an assignment (e.g., creating or deleting a block) with 
timestamps. There are two types of actions in the traces, 
non-coding actions, and coding actions. Non-coding actions 
do not change the program scripts, for example, searching 
for blocks and running the program. Coding actions change 
the abstract syntax tree of program scripts, such as creat- 
ing variables, creating custom blocks, and reordering blocks. 
Alongside every coding action, our Snap/ environment also 
saves a code snapshot after the action, allowing us to recon- 
struct the steps a student took to build their final code and 
analyze their coding progress. 


In this work, we focused on analyzing two assignments, Squiral 


and Guessing Game. Squiral is a homework assignment that 
asks students to create a procedure that draws a square-like 
spiral with a certain number of rotations specified by an in- 
put parameter. A correct Squiral solution typically contains 
7 to 14 lines of code. Guessing Game (GG) is an in-class as- 
signment that requires students to create a game that greets 
the players by their names, asks the player to guess the se- 
cret number, and tells the player if their guesses were too 
high too low until they guessed correctly. A typical Guess- 


ing Game solution contains 14 to 18 lines of code. These 
two assignments allow us to explore how well our method 
identifies students’ struggling moments in assignments with 
different time constraints. Table 1 shows the descriptive 
statistics of the traces analyzed in the two assignments. We 
preprocessed the traces to remove any idle time of more than 
five minutes, during which the student did not perform any 
action. 


Table 1: Descriptive statistics of the trace logs and grades of 
the two assignments. 

Traces Rows 

Squiral 45 

GG 59 


Time on Task Avg Grade 
25160 29.6m 9.8/12 
22744 30.5m 11.7/12 


2.2 Define Progress 

The first step to identify struggling is to measure student 
progress in the assignment. Previous work used code com- 
pilation results [11, 19, 2], students’ programming behavior 
patterns [7], and features completion [15, 13] to monitor 
student progress in an assignment. While these criteria are 
reliable indicators of how many assignment requirements the 
students have met, they did not use the student traces’ full 
potential to identify struggling students at an action-level 
granularity during the assignment. 


We adopted the SourceCheck algorithm [17] to measure stu- 
dent progress during an assignment. SourceCheck was ini- 
tially designed as a hint generation algorithm to provide on- 
demand, next-step hints to help students move towards the 
closest correct solution to the student’s current code. When 
generating hints, emphSourceCheck first compares the stu- 
dent’s current code snapshot with a list of correct solutions 
(usually collected from past student data) and generates 
mapping costs from the student’s code snapshot to each of 
the correct solutions. These mapping cost values represent 
how similar the correct solutions are to the student code — 
the more similar a correct solution is to the student’s code, 
the lower the mapping cost is. SourceCheck picks the correct 
solution with the lowest mapping cost as the closest correct 
solution and generates next-step hints to move the student 
to that solution. 


We adapted the mapping cost into a similarity score by re- 
versing the mapping cost such that when a student moves 
closer to the closest correct solution, the mapping cost de- 
creases, and the similarity score increases. One novel aspect 
of this paper is how we use the similarity score to measure 
student progress in the two assignments ' We calculate a 
similarity score for every snapshot in a student trace using 
the SourceCheck algorithm. We define a snapshot’s progress 
in an assignment as the similarity score difference between 
the current snapshot and its previous snapshot. As such, at 
a particular snapshot, we say a student is making a positive 
progress if the similarity score difference is positive and a 
negative progress if the similarity score difference is nega- 
tive. 


‘Note that we assume a student is moving towards the clos- 
est correct solution at any given snapshot. Students do not 
know the prior student solutions used by SourceCheck, and 
we have no ground truth to identify what strategy a student 
may be using for their assignments. 
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We can visualize a student’s progress in an assignment by 
plotting the similarity scores for each snapshot against the 
cumulative active time, shown in blue in Figure 1. The red 
line in Figure 1 represents what we call “absolute progress”, 
which we define in the next section. The blue dots in Figure 
1 represent the similarity score of every snapshot in a Squiral 
trace, and the blue line represents the progress (similarity 
change between consecutive snapshots) over time. Figure 1 
demonstrates that the student made steady positive progress 
in the first eight minutes. Then, the student had a reduced 
similarity score for about four minutes, tearing the code 
apart trying to complete an objective of the assignment. 
Afterward, the student continued to make rapid positive 
progress until 14 minutes, stopped progressing for around 
three minutes, and finally reached their final submission at 
around 17 minutes. 


—— Similarity score 
~— Absolute progress 


Similarity Score 


0 10 20 30 40 50 60 70 80 90 


O60 180 300" 420 " 540 " 660 " 780 900° 1020 " 1140 
Cumulative Active Time (s) 


Figure 1: Similarity score (blue) and absolute progress (red) 
change over time in one student trace of Squiral. 


2.3 Determine Significant Progress 

Through inspecting multiple students’ traces and comparing 
them with their corresponding progress plot (e.g., Figure 1), 
we found that not all positive progress represents a signif- 
icant change to the program. Some minor similarity score 
increases were due to reordering code that does not change 
the code semantics but slightly reduces the mapping cost. 
This observation means that using any amount of similar- 
ity score increase for making progress may not be sufficient. 
Thus, it is important to determine how much similarity score 
increase can be considered significant progress. 


To determine significant progress, we first define absolute 
progress for a student s; at time t;. We define Smaz(s;, ti) 
to be the maximum similarity score achieved by student s; 
in an assignment between time tp and tj, Smaz(s;,ti) = 
mazi,_9$(sj,tx). We then define absolute progress as the 
difference in the maximum similarity scores between t; and 
the previous snapshot time ti-1, Pabsotute($j,ti) = 
max(Smaz(s;,ti) — Smax(Sj,ti-1),0). To visualize absolute 
progress, we plot the highest similarity scores achieved since 
the beginning of the trace (Smaz), as shown in red in Figure 
1. The absolute progress is positive whenever Smaz increases 
between two consecutive snapshots. 


We then calculated and sorted the absolute progress values 
from all the student traces for each assignment in increasing 
order and plotted all positive absolute progress values by 
percentile (using the quantile function in R), as shown in 
Figure 2. We used the 25th percentile of absolute progress 
values as the threshold for making significant progress. This 


choice was also used in another work identifying struggling 
students in a MOOC programming assignments [20]. The 
intuition is that if a student’s absolute progress is no more 
than three-quarters of all the absolute progress, we consider 
the student is not making enough progress. Figure 2 shows 
that the significant progress threshold is 1.25 for Squiral and 
1.5 for Guessing Game. 
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Figure 2: Positive absolute progress in all traces by percentile 


2.4 Determine Typical Time 

Now that we have determined the significant progress for 
each assignment, the next step to identify struggling mo- 
ments is to extract the typical time for students to make 
significant progress. To do this, we first split all the traces 
into code chunks whenever the student makes significant 
progress—each code chunk contains multiple snapshots. This 
gave us 648 code chunks from Squiral and 1207 code chunks 
from Guessing Game. Then, we calculate the elapsed time 
between the first and the last snapshot in each code chunk 
and organize the elapsed times in ascending order. We plot 
the ordered elapsed times in Figure 3 where the y-axis is the 
elapsed time, and the x-axis is the percentile of the elapsed 
time distribution. The green line in Figure 3 marks the third 
quartile of time used to make significant progress. Note 
that the elapsed time to make significant progress grows al- 
most exponentially after the third quartile. Therefore, we 
chose to use the third quartile as the cutoff for the typical 
time to make significant progress. The third-quartile time 
(dashed green line) in Figure 3 intersects with the Squiral 
progress (solid blue line) at 105 seconds and intersects with 
the Guessing Game significant progress (solid red line) at 85 
seconds. We use these times as the typical time for the stu- 
dents to make significant progress in Squiral and Guessing 
Game. There are several dashed lines on Figure 3, which we 
explain in section 3. 


2.5 Determine Progressing and Struggling 


Moments 
The last step of this process is to use the typical time to make 
significant progress in identifying progressing and struggling 
moments for each assignment. To do this, we took all the 
code chunks generated in the third step and divided them 
into two groups, struggling moments and progressing mo- 
ments. Recall that we define a student as struggling if 
the student does not make enough progress within a typ- 
ical amount of time. Therefore, struggling moments are 
defined to be code chunks that have elapsed time greater 
than our struggling time threshold (75th percentile of time 
for significant progress), meaning that in this code chunk, 
even though students spent a long time, they did not make 
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Figure 3: The elapsed time for all code chunks to make sig- 
nificant progress by percentile. The green, yellow, blue and 
red dash lines mark the proposed, earlier, even earlier, and 
later intervention times, respectively. 


significant progress. Conversely, progressing moments are 
defined to be code chunks that took equal or less time than 
the 75th percentile of the typical time to make significant 
progress. Table 2 summarizes the significant progress, typ- 
ical time for significant progress, and the number of code 
chunks generated for each assignment. 


Table 2: A summary of products generated by our algorithm 
for the two assignments. 


Squiral GG 
Sig. Progress 1.25 1.5 
Typical Time for Sig. Prog. 105.3s 84.58 
# Code Chunks 648 1207 


# (%) Struggling Moments  131(20.2%) 269(22.3%) 


# (%) Progressing Moments 517 (79.8%) 938(77.7%) 


3. EVALUATION 


Our evaluation is driven by the following research questions: 


RQ1: To what extent the human experts agree with the 
progressing moments and struggling moments identified by 
our method? 

RQ2: What are the common causes that the human experts 
do not agree with the progressing and struggling moments 
identified by our method? 


We invited three expert raters, all included as authors, to 
participate in the rating of whether an intervention is needed 
for given code chunk samples. All three experts are com- 
puter science graduate students, two with extensive expe- 
rience analyzing Guessing Game traces, and all three with 
extensive experience analyzing Squiral traces for research. 


We first created the struggling rating sample dataset by pick- 
ing a random struggling moment from each trace until the 
rating dataset contains 20% of all struggling moments for 
each assignment. Then, we created the progressing rating 
sample dataset by selecting the progressing moments im- 
mediately before each struggling moment in the struggling 
rating sample dataset. Finally, three redundant progressing 
moments were excluded from the progressing rating dataset 
because they were shared by consecutive struggling moments 
in the same trace. As a result, our rating dataset ended up 
with 29 struggling moments and 29 progressing moments 
for Squiral (out of 131), and 57 struggling moments and 54 


progressing moments for Guessing Game (out of 269). 


The experts were told to imagine themselves as TAs when 
rating. They used a customized interface that allowed them 
to visually step through the students’ code changes in the 
rating moments to decide whether an intervention is needed. 
The experts used the time elapsed between actions and the 
type of the actions to inform their rating decision. They 
were asked to avoid using hindsight, which means that they 
should not justify their decision for intervening at an earlier 
time using student’s later actions. 


When rating the struggling moments, to make it easy to 
compare expert ratings, the experts were given five interven- 
tion timing options. The five options corresponds to poten- 
tial intervention times marked by the colored vertical lines 
shown in Figure 3, which correspond to suggesting an in- 
tervention at time: blue or before (quantile(0.55)), yellow 
(quantile(0.65)), green (quantile(0.75), the typical time to 
make significant progress), red (quantile(0.85)), or “not now” 
(after red, or never). We chose these candidate percentiles 
to gain insights into expert preferences of intervention times 
for future analysis. For struggling moments, experts were 
shown the code changes from the start until the last action 
before the 85th percentile time for significant progress (red) 
to decide when an intervention would be most appropriate, 
or “not now.” For progressing moments, experts were shown 
the entire progressing moment (all code changes within the 
time period where significant progress was achieved) and 
asked the expert to rate whether the moment “needs in- 
tervention” or “not now.” Aside from rating the struggling 
and progressing moments, experts were also encouraged to 
take notes on why they believed an intervention is needed 
whenever they rate a sample as “needs intervention” for both 
the struggling and the progressing rating sample datasets. 
These notes will help us understand the experts’ point of 
view when inspecting the disagreements between the experts 
and our algorithm. 


Before formal rating, the experts practiced rating on a train- 
ing dataset by immediately discuss their ratings after rating 
each sample until they were comfortable with the rating pro- 
cess. Then, the experts rated the struggling moments inde- 
pendently in three rounds, each round rating a third of the 
samples in the dataset. After each round, the experts gath- 
ered and discussed the differences in their ratings to share 
perspectives and resolve disagreements caused by oversight. 
We did not require the experts to reach a complete consen- 
sus on the rating because experts sometimes have different 
opinions on handling specific situations. Finally, after rat- 
ing the struggling sample dataset, the experts independently 
rated the progressing dataset and were asked to check dis- 
agreements to correct rating errors caused by oversight. 


4. ANALYSIS AND RESULT 


To evaluate our RQ1 considering to what extent the human 
experts agree with the struggling moments and progressing 
moments identified by our algorithm, in this analysis, we 
merged the five rating options the experts used in the rating 
of the struggling chunks dataset into two options, “need in- 
tervention” and “not now.” Specifically, we merged "at blue 
or before,” at yellow,” and ”at green” options into “need in- 
tervention” and merged the ”at red” and “not now” options 
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into “not now.” These two option labels are identical to those 
used when rating the progressing moments to directly com- 
pare how much the experts agreed with the generated strug- 
gling and progressing moments. Splitting and merging the 
options at green where the proposed time is at allows us to 
determine if our proposed typical time to make significant 
progress is appropriate and enough for the experts to decide 
whether they believe an intervention is needed. 


Our analysis of expert ratings focuses on their ratings after 
discussion because those ratings are after the effort to re- 
move personal error or bias. However, we report the inter- 
rater reliability (IRR), calculated with Fleiss Kappa, both 
before and after discussions to demonstrate the impact of the 
discussions. To determine how well the experts agree with 
the algorithm, we turned each expert’s ratings into binary 
scores of Os and 1s depending on whether the expert rating 
agrees with the rating dataset. Specifically, for struggling 
moments, the score is 1 if the expert rated need interven- 
tion” and the score is 0 otherwise. For progressing moments, 
on the other hand, the score is 0 if the expert rated ”need 
intervention” and the score is 1 otherwise. We then take the 
average of all expert ratings to calculate a combined rating 
score for each sample such that a combined rating of 0 or 1 
represents that the experts reached an agreement and any- 
thing in between 0 and 1 means the two or three experts had 
different opinions, even after discussion, on whether the stu- 
dent was making progress or struggling. Finally, we calcu- 
late the agreement rate for each rating dataset by summing 
up all the combined rating scores in that rating dataset and 
divide the sum by the total number of samples in the rating 
dataset for that assignment. 


4.1 RQ1: Expert Agreement 


Table 3 shows the percentage of progressing moments and 
struggling moments that the expert ratings agreed with the 
algorithm after discussion, as well as the corresponding inter- 
rater reliability before and after discussions. Looking at the 
agreement rate between the expert and our algorithm, we see 
that for both assignments, over 77% of the time, the human 
experts agreed that an intervention was needed when the al- 
gorithm determined the student was struggling. Over 85% of 
the time, the human experts agreed that an intervention was 
not needed when the algorithm determined the student was 
making progress. This suggests that our method was able to 
identify struggling moments and progressing moments from 
the trace data with decent accuracy. 


Table 3: Human expert agreement with the algorithm iden- 
tified struggling moments and progressing moments 


Struggling Rating Progressing Rating 


Need IRR Not 

Interv. (B/A) 2 Now TE 
Squiral 0.805** a 
(3 raters) 29 77.0% ig.ggyee 29 85.2% 0.853 
ig 0.539" 7 
(2 raters) 57 83:3% ig gagex 54 85.1% 0.819 


Looking at the expert agreement with each other, we found 
that the experts had excellent inter-rater reliability for pro- 
gressing moments ratings on both assignments and for strug- 
gling moments ratings on Squiral. However, the experts 


were only able to reach a moderate agreement, even after 
discussion, for struggling moments on Guessing Game. We 
explored reasons that might cause experts to disagree with 
each other by manually inspecting the traces and their notes. 
We found that in four out of five struggling moments that the 
experts disagreed, students did not have errors in their snap- 
shots but only performed less than six actions, which is way 
below the average number of 12 actions of all rated strug- 
gling moment samples. In such cases, one expert prefers 
to hold off on any intervention until seeing more student 
actions, whereas the other expert believes that the student 
needed a nudge telling them what they should do next. We 
did not see such a case in Squiral because all the rated strug- 
gling moments with few student actions had relatively ap- 
parent flaws in the student codes that warrant intervention. 
We will talk more about this in the discussion section. 


4.2 RQ2: Common Causes of Disagreement 
We manually investigated the ratings on which the human 
tutors and our algorithm disagreed. We present some com- 
mon causes of disagreement for struggling moments and pro- 
gressing moments, respectively. 


Disagreement in Struggling Moments 

Solution Matching: A decreased similarity score does not al- 
ways mean the student is making negative progress. Due 
to characteristics of the SourceCheck algorithm, in some 
cases, a reduced similarity score can also be caused by the 
SourceCheck algorithm mapping the student’s previous snap- 
shot and the current snapshot to different correct solutions 
because the student added a particular code block. In such 
cases, if the student similarity score does not surpass the 
maximum similarity score since the beginning within an ex- 
pected amount of time, our algorithm will determine the 
student is struggling, even though the student is making 
progress (having an increasing similarity score). However, 
the student may just be using a different approach to solve 
the problem from the expert’s perspective. 


Few Coding Actions: Since our algorithm focused on stu- 
dents’ progress over time, sometimes students might not 
have taken enough actions for experts to determine if the 
student is struggling or not. There are several possible rea- 
sons why students have few actions, including trying to rea- 
son with their code, running their code, evaluating the re- 
sult, or being off task. Disagreement in this situation did 
not only occur between experts and our algorithm but also 
happened between expert raters, causing a relatively lower 
inter-rater reliability of struggling moments in the Guessing 
Game assignment. 


Disagreement in Progressing Moments 

Logic Errors: The experts are good at catching critical logic 
errors in the student code and tend to intervene if there 
is a critical logic error in student code that might prevent 
them from completing the feature they are working on. For 
example, in Guessing Game, both experts decided to inter- 
vene when a student set the secret number to a boolean 
value and was trying to use the secret number to give player 
feedback on their guesses because the student would not be 
able to test the feedback feature properly without correctly 
setting the secret number first. In contrast, our algorithm 
considered that the student was making progress because 
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the students added blocks seen in the correct solutions. 


Human Factors: Sometimes, when deciding on intervention, 
the experts assess the natural language in students’ code to 
infer information about student intention in a way that is 
not possible for our algorithm. For example, in the Guessing 
Game assignment, experts pointed out that an intervention 
is needed when several students used an if/else block instead 
of combining the say block and the join words block to greet 
the players. However, since the if/else block was used for 
giving player feedback in many correct solutions, our algo- 
rithm determined the student was making progress, despite 
the student using the if/else block incorrectly for a different 
purpose. 


5. DISCUSSION 


RQ1: To what degree do the experts agree with the algorithm 
on struggling and progressing moments? For Squiral, the ex- 
perts agreed that an intervention was needed for over 77% 
of the struggling moment samples and agreed that no inter- 
vention was needed for over 83% of the progressing moment 
samples. However, the experts have a relatively lower dis- 
agreement with each other in Guessing Game ratings, caused 
by different opinions on whether an intervention is needed 
when a student has few actions within the time frame. Since 
the expert raters were not pedagogical experts and had ex- 
perience level on par with experienced teaching assistants 
(TAs), we do not know how an experienced instructor would 
react to this or other scenarios. Nevertheless, our results 
show that our method of identifying struggling moments by 
measuring whether a student could make significant progress 
within a typical amount of time aligns well with opinions 
from our experts who have at least as much experience as 
highly qualified TAs. We believe this shows great potential 
to determine when to give proactive interventions to stu- 
dents in intelligent tutoring systems for programming. 


RQ2: What are the common cases that our algorithm does 
not handle well? We listed four common causes that led 
to disagreements between the experts and our algorithm. 
The first cause, “solution matching”, rests in the adoption of 
SourceCheck as a progress measure. Since the SourceCheck 
compares student solutions to multiple model solutions, some- 
times, adding a code block to a snapshot could cause Source- 
Check to pick a different solution as the closest solution with 
a lower similarity score. This calls for investigation on how 
we may wish to smooth out the abrupt progress change when 
mapped to different correct solutions between consecutive 
snapshots. “Logic errors” and “human factors” are problems 
that are difficult to solve by using distance-based similar- 
ity measures alone, since similarity measures merely com- 
pare the mapping cost between two pieces of code and do 
not assess the semantics in natural language or program- 
ming logic. Thus, these two causes may require obtaining 
knowledge from other types of analysis to identify. Previous 
research has used compilation error {11, 19, 2], and feature 
detection [14] to detect if there are specific errors and deter- 
mine if the student is struggling. Incorporating these meth- 
ods could provide richer information in the decision-making 
for struggling moments. Lastly, some expert disagreements 
with the algorithm in struggling moments were caused by 
dealing with “few actions” within the time frame of the code 
chunks. Our inter-rater reliability for Guessing Game shows 


that even experts had a hard time agreeing on if an inter- 
vention is needed in this case. There seem to be multiple 
factors that may affect the expert decision, including if there 
are errors in the snapshot, the type of actions performed, 
and the assignment requirements that are completed and in- 
completed. Potential solutions to this problem may include 
consulting experienced teachers and incorporating sequence 
pattern analysis [7]. 


It’s worth pointing out one important contribution of this 
work is that our method of using a data-driven approach to 
identify struggling moments has the potential to be general- 
ized into any other domain that meets the following criteria. 


1.Student Performance: Good student performance on the 
task, meaning that the majority of the students achieve cor- 
rect or mostly-correct final solutions. 

2.Trace log data: having time-stamped trace logs that doc- 
uments students’ snapshots during an assignment. 
3.Progress measure: A score, or a combination of correct so- 
lutions and a distance metric between snapshots and correct 
solutions, as we devised from SourceCheck. 


There are two major ways our method can benefit future 
research. First, our method of identifying progressing and 
struggling moments provides a new way for researchers to 
study novice problem-solving behaviors and identify com- 
mon misconceptions. Second, the significant progress value 
and typical time to make significant progress can be incorpo- 
rated into intelligent tutoring systems for providing proac- 
tive feedback to struggling students. 


This work has some clear limitations. First, we only used 
three expert raters to evaluate the sample result. The expert 
raters were not pedagogical experts and had experience lev- 
els on par with experienced TAs. Hence the evaluation result 
may be different if rated by experienced instructors. In addi- 
tion, our work is limited by a small sample size with only two 
programming assignments. We need further investigation to 
determine if our result holds for assignments with even more 
varied complexity or in other problem-solving contexts. 


6. CONCLUSION 


This work presented a novel, data-driven approach to use 
a similarity measure to model student progress in program- 
ming assignments and identify progressing and struggling 
moments from trace log data. To evaluate the performance 
of our algorithm, we asked human experts to evaluate a 
sample of 20% of the algorithm-identified progressing and 
struggling moments from trace logs from students solving 
two programming assignments and rated if the experts agree 
with the algorithm. Our result shows that the expert agreed 
with over 77% of the struggling moments and over 83% of the 
progressing moments, which shows great potential. Our al- 
gorithm can be generalized to different domains if they have 
good student performance, trace log data, and a progress 
measure for in-progress student solution attempts. 
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ABSTRACT 


Automatically detecting bugs in student program code is 
critical to enable formative feedback to help students pin- 
point errors and resolve them. Deep learning models es- 
pecially code2vec and ASTNN have shown great success 
for large-scale code classification. It is not clear, however, 
whether they can be effectively used for bug detection when 
the amount of labeled data is limited. In this work, we in- 
vestigated the effectiveness of code2vec and ASTNN against 
classic machine learning models by varying the amount of 
labeled data from 1% up to 100%. With a few exceptions, 
the two deep learning models outperform the classic mod- 
els. More interestingly, our results showed that when the 
amount of labeled data is small, code2vec is more effective, 
while ASTNN is more effective with more training data; for 
both code2vec and ASTNN, the more labeled data, the bet- 
ter. To further improve their effectiveness, we investigated 
the potential of semi-supervised learning which can leverage 
a large amount of unlabeled data to improve their perfor- 
mance. Our results showed that semi-supervised learning is 
indeed beneficial especially for ASTNN. 


Keywords 
CS Education, Machine learning, Program analysis, Bug de- 
tection, semi-supervised learning 


1. INTRODUCTION AND BACKGROUND 


When students encounter difficulties during programming, 
they are often caused by systemic procedural errors, or “bugs” 
[9], which can occur repeatedly across problems [38, 8]. For 
example, a student may confuse when it is appropriate to 
use the and and or operators, or fail to consider a boundary 
case in a condition, using > instead of >= [17]. These bugs 
are rarely directly addressed by the compiler or test-case 
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feedback employed in most computer science (CS) courses, 
which are generally limited to suggesting syntax errors, or 
which correct input-output pairs the program fails to repli- 
cate. Historically, tutoring systems in a variety of learn- 
ing domains have detected these bugs automatically (e.g. 
through a bug library [9, 4]). The detection can be used 
to offer tailored formative feedback [34] that address bugs 
directly [22], and can also help instructors to be more in- 
formed about student learning process [25]. The detection 
of bugs often requires experts’ manual definitions, with dis- 
tinct rules for detecting the bug on different problems [4]. 
This can make it impractical to use bug detection in practice. 
Most current automatic grading systems for student code are 
mainly based on test cases, which provide a score and failed 
test case information to students [15, 16, 37]. Nevertheless, 
the relationship between code’s output and the presence of 
specific bugs in student code is not clear, since a given er- 
roneous output could be caused by various errors in student 
code. An automatic bug detection system for student code 
could be useful to fill in the gaps for students. 


Machine learning (ML) algorithms are powerful tools for 
data analysis, which have been commonly used for auto- 
matic programming code analysis [10]. Classical machine 
learning methods, such as support vector machines [13] and 
XGBoost [11], are capable of classifying program code [12, 
21, 18]. Recent advances in machine learning have lever- 
aged structural information in code to accurately classify 
and label it [2, 3, 41, 28]. For example, Alon et al. explored 
path representations on code represented as trees [2], and 
designed the code2vec model to learn the representations 
using deep neural networks [3]. Abstract Syntax Tree based 
Neural Network (ASTNN) by Zhang et al. applied recursive 
neural networks in the structure, outperforming Tree-based 
Convolutional Neural Networks [28] and other state-of-the- 
art models [41]. 


However, to apply the models to detect student program 
bugs, two challenges need to be addressed. First, these deep 
models were originally designed for professional programs 
which are fundamentally different than code written by stu- 
dents [39]. Some recent work has applied these techniques to 
educational domains [33, 19, 26, 30, 6], but they either used 
base models years before, [28, 19], or are not specifically used 
for bug detection [33, 26, 30, 6]. Second, deep learning mod- 
els are traditionally “data hungry” [1], using large, labeled 
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training datasets (e.g. [19] was trained on 270k samples). 
However, in most educational settings, datasets can be much 
smaller (e.g. ~100 students), and labeling (e.g. to identify 
bugs) can take extensive expert effort [14]. This suggests the 
potential of leveraging a semi-supervised learning strategy, 
using a mixture of labeled and unlabeled data [42]. Semi- 
supervised learning, such as the Expectation-Maximization 
(EM) method, uses unlabeled data for model improvement 
[42]. However, studies show that the usage of unlabeled 
data may not always help [35]. Thus, an empirical evalua- 
tion, suggested by recent studies [29], to investigate whether 
semi-supervised learning with unlabeled data actually helps 
is needed. 


To address these challenges, in this paper, we evaluate two 
state-of-the-art deep learning methods: code2vec [3] and 
ASTNN [41], on the task of automatically detecting pro- 
gramming bugs in student code. We manually labeled three 
bugs in ~1800 code submissions from 410 students in a Java 
programming course, where each bug occurred in 4-6 distinct 
problems. Our results show that, when using all available 
training data, the ASTNN model performs best at detect- 
ing all three bugs, outperforming code2vec and two classical 
baseline models (support vector machines and XGBoost). 


Furthermore, we investigate whether a semi-supervised learn- 
ing approach can improve the code2vec and ASTNN per- 
formance without requiring additional labeled data. More 
specifically, we investigated how the deep and baseline mod- 
els performed with different amounts of labeled training data 
through a “cold start” analysis [32]. We found that all mod- 
els benefited from more data. However, despite deep models’ 
reputation as “data hungry,” we found the top-performing 
model was generally a deep model, regardless of training 
data size. However, which model performed best depended 
on the data size, with code2vec outperforming ASTNN when 
less labeled data was available. We also found that semi- 
supervised learning generally improves both code2vec and 
ASTNN by using unlabeled data. This effect was most con- 
sistent for ASTNN, where semi-supervised learning consis- 
tently improved the model performance by 5% to 20% on 
all splits. For code2vec, we also found that it required very 
little data (5%) to achieve 80% of its peak F1 score. 


The major contributions of this paper are addressing three 
research questions (RQs): 


e RQ1: How well do state-of-the-art deep learning mod- 
els for programming code perform in a student bug 
detection task? 


e RQ2: How are deep learning models’ performance im- 
pacted by the amount of available training data? 


e RQ3: To what extent does semi-supervised learning 
improve the performance of the deep learning models? 


2. APPROACHES 


In this section, we introduce how we build code2vec and 
ASTNN for program classification; and how we applied the 
semi-supervised learning strategy on them to leverage unla- 
beled data. 
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Figure 1: Code2vec model structure: model takes a set 
of paths as input, and through embedding layers, attention 
layer, then detect if the input code has bugs (1) or not (0). 


2.1 Code2vec 


One primary technical challenge in applying machine learn- 
ing to program code lies in code representation. Code is 
often represented using an abstract syntax tree (AST) [7], 
while most learning algorithms expect a fixed-length vec- 
tor. To solve this issue, sub-components of the ASTs are 
used as inputs for deep learning models. In the case of the 
code2vec model, it learns a code embedding through leaf- 
to-leaf paths, represented as strings. Strings of nodes and 
paths are mapped into numbers by tokenizers, where differ- 
ent strings are mapped into different numbers. These num- 
bers are used as the input of a code2vec model, shown in 
Figure 1. Assume we have a code snippet that produces R 
paths (po,...,Dr) to be fed into the code2vec model. These 
numbers are embedded into e-dimensional vectors through 
node and path embedding layers (Wenode and Wepatn) re- 
spectively, and these node and path vectors are concatenated 
together into one vector for each of the paths (eo,...,er). 
These vectors form a matrix E, where E € R°*®. Then 
these path vectors pass through a soft attention layer Wa 
[40], where they calculate the soft attention weight a for 
each of the paths: a = SoftMar(W,'E), Wa € R™?, 
and thus a has scalar weights a, for each of the paths, nor- 
malized by a SoftMax operator. Then the embedded path 
vectors E take the dot product of the calculated attention 
weights, showing which paths are more important in a code 
snippet. Then the weighted average vector passes through 
two fully-connected layers to make the bug classifications. 
In the training process, all the W weights are updated us- 
ing Adam [23] optimization algorithm, while in the evalua- 
tion and validation processes, the weights in model are not 
changed. 


2.2 ASTNN 

Different from the path-based inputs for code2vec, ASTNN 
utilize the statement-level ASTs to learn a vector for the 
code. Specifically, we split the large AST of a code fragment 
by the granularity of the statement and extract the sequence 
of statement trees (ST-trees) with a pre-order traversal, and 
feed them as the raw input of ASTNN. Suppose that we 
have a set of ST-trees (51, 52,...,87), our goal is to learn a 
vector representation z for the original code. The detailed 
architecture of ASTNN is shown in Figure 2. 


Statement Encoder: Each ST-tree is composed of a root 
node and its child indices from a limited vocabulary of up to 
V symbols. For a ST-tree s;, we first represent all nodes with 
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Figure 2: ASTNN model structure: model takes a set of 
statement trees as input, and through encoder layer, Bi-GRU 
layer, max-pooling layer, then to detect if the input code has 
bugs (1) or not (0). 


the pre-trained embedding matrix Wembea € RY <4 where 


V is the vocabulary size and d is the embedding dimension. 
Thus the initial vector of a node n can be obtained by: 


Vn = WembedXn (1) 


where z,, is the one-hot encoding of node n. Next the ST- 
tree will go through a Recursive Neural Network [36] based 
encoder layer to update the vector for each node: 
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where Wencode € R?*” is the encoding matrix and k is the 
encoding dimension. vp, is obtained from Equation 1 and bp, 
is the bias term. o is the activation function and in this work 
we followed the original paper to set o as identity function. 
After recursive optimization of the vectors of all nodes in 
the ST-tree, we sample the final representation e; for s; via 
a max-pooling layer. 


Code Representation: Based on the sequences of ST-tree 
vectors, bidirectional GRU [5] is applied to track the natu- 
ralness of statements sequence (€1, €2,...,er), where T is the 
number of ST-trees in the AST: 


= [GRU (e:), GRU(ei)], i € [1, L] (3) 


The statement representation h; € R”*?™, where m is the 


embedding dimension of GRU. Finally, similar to Statement 
Encoder, a max-pooling layer is used to sample the most im- 
portant features on each of the embedding dimensions. Thus 
we get z € R?’”, which is treated as the final vector repre- 
sentation of the original code fragment. Finally z vectors 
pass through a linear layer to make the final classification of 
the bugs. 


2.3 Semi-supervised Learning Strategy 

While we explore the potential of machine learning mod- 
els using insufficient labeled data as training inputs, unla- 
beled data can also serve as an important resource for the 
models to learn the structure of code. We applied a semi- 
supervised learning strategy to utilize these unlabeled data 
to help the model update. Specifically, in our experiments, 
we used Expectation-Maximization (EM) method [42] as an 
exploratory attempt. 


EM method is iterative, and it contains two steps for ev- 
ery iteration: 1) In expectation steps, the model infers on 
the unlabeled dataset, getting a probability score, which will 
be served as the pseudo-label in the next step, for each of 
the unlabeled code snippets; 2) In maximization steps, the 
model is retrained using all the labeled training dataset and 
the unlabeled set with the pseudo-labels from expectation 
step. After retraining, the model is used for the next round 
of expectation step. In our case, deep learning models are 
designed to output probability scores, but SVM and XG- 
Boost models make classifications without clear scores or 
probabilities. We implemented the regression versions of 
the models, assuming they would output a continuous prob- 
ability as the regression result. We then used 0.5 as the 
probability threshold to binarize the output, serving classi- 
fication results. Every model uses a unified 10 iterations of 
EM steps, assuming the models are able to converge after a 
certain number of iterations and retraining. 


3. EXPERIMENT SETTINGS 
3.1 Dataset and Bug Labeling 


We performed bug classification on a publicly available dataset, 
collected from an entry-level Java programming class in Spring 
2019". It was collected from the CodeWorkout [15] platform 
and stored in ProgSnap2 [31] format. Since Java compiler 
can already detect bugs from code that failed to compile (due 
to syntax errors), and this code cannot be converted into 
an AST, we excluded uncompilable code from our analysis. 
We also did not use code that passed all test cases, as this 
code is correct and therefore is very unlikely to have bugs. 
There are 410 students, who attempted in total 50 problems 
from 5 assignments. Typical solutions for these assignments 
range from 10 to 20 lines of code. In order to determine the 
common set of bugs across different problems, two authors 
examined student code from six distinct programming prob- 
lems from the first assignment and identified common bugs 
that arose. They then selected 3 prevalent ones after calcu- 
lating the coverage of bugs from each problems, and identi- 
fied in prior CS education literature [17, 20]. This included 
2 logical bugs and 1 syntax bug: comparison-off-by-one 
(logical), assign-in-conditional (syntax), and and-vs-or 
(logical), defined below: 


comparison-off-by-one: This bug occurs if, in a condi- 
tional expression (e.g. in an if or while), the student’s 
code uses a greater/less-than comparison operator (<=, >=, 
<, >) incorrectly, and this error can be resolved by adding 
or removing the ‘=’, (e.g. < becomes <=). The direction of 
comparison (i.e. <= vs >=) should already be correct. This 
often indicates an “off by one” error, and it is contextual, de- 
pendent on the number of literals being compared. If there 
are multiple bugs, including this bug, we still count it. 


assign-in-conditional: This bug occurs if, in a condi- 
tional expression, a student uses the = assignment operator 
in their code when trying to compare a variable with another 
value, rather than the correct == comparison operator. This 
is a syntax-based bug, but it is not detected by the compiler, 
since the assignment is logically a valid operation. 


and-vs-or: This bug occurs if a student uses the logical 
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Table 1: Detection performance for four classifiers on three 
bugs. 


Method | Accuracy | AUC | Precision | Recall | F1 Score (Std) 

SVM 0.753 0.658 0.731 0.100 | 0.173 (0.045) 

comparison- | XGBoost 0.505 0.54 0.384 0.547 | 0.334 (0.088) 
off-by-one | Code2Vec 0.736 0.746 0.500 0.556 | 0.522 (0.058) 

ASTNN 0.785 0.704 0.606 0.533 0.560 (0.090) 

SVM 0.943 0.959 0.918 0.627 | 0.733 (0.099) 

assign-in- | XGBoost 0.847 0.877 0.494 0.726 | 0.563 (0.112) 

conditional | Code2Vec 0.917 0.907 0.725 0.688 | 0.672 (0.119) 
ASTNN 0.970 0.90 0.961 0.807 0.868 (0.094) 

SVM 0.722 0.674 0.534 0.173 | 0.256 (0.078) 

anton XGBoost 0.503 0.669 0.350 0.784 | 0.470 (0.045) 

Code2Vec 0.758 0.82 0.570 0.663 | 0.609 (0.078) 

ASTNN 0.880 0.837 0.820 0.739 0.773 (0.064) 


operator and instead of or in their code, or vice-versa, such 
that the opposite operator would produce correct code. This 
is also a logical bug that requires contextual information but 
is easier to detect than comparison-off-by-one. It requires 
the literals, but does not depend much on problem require- 
ments. 


Two authors started by labeling 20% of the data, following 
the same set of initial bug definitions. The labeling process 
was iterative: the two authors first labeled 20% of the data 
independently and then calculated Cohen’s Kappa scores k. 
If on any of the three bugs, the two authors did not achieve 
a score higher than the 0.8 [24], then the authors discussed 
and resolved the disagreements, refined the definitions, and 
continued to another round of independently labeling 10% of 
the data. This process continued until the authors reached 
high agreement (« > 0.8) on all three categories of bugs, 
which occurred after labeling 40% of the data. The first 
two rounds of labeling did not achieve a high « score, both 
due to the low scores on the comparison-off-by-one bug, 
suggest that this bug may be more difficult to consistently 
detect for humans. On the third round of labeling, the two 
authors achieved 0.81, 0.97 and 0.84 & scores on the three 
bugs. Then the authors divided the rest 60% of data by 
35% for each person to label, overlapping on 10% of the data 
for verification. These 10% of data achieved 0.78, 0.98 and 
0.95 « scores, indicating moderate to near-perfect agreement 
[27]. The finalized labeled dataset has a biased distribution, 
as only 30% of the submissions have comparison-off-by- 
one bug, 28% of the submissions have and-vs-or bug, and 
13% have assign-in-conditional bug. In total, we spent 
around 20 hours and labeled 1867 code snippets from 296 
students. 


3.2 Splits in Experiments 

Since our dataset included multiple attempts from a given 
student, we split our data into training and testing sets by 
student. This ensured that a given student’s code showed 
up in either the training or testing set, but not both. In our 
experiment, we have 20% of the data as the test set, and the 
rest 80% are used for model generation. To check the perfor- 
mance of models with limited labeled data, we further split 
the 80% of data into labeled data and unlabeled data. We 
use only labeled data for supervised learning, and use both 
labeled and unlabeled data for semi-supervised setting. All 
these splits were stratified according to the class label and 
number of submissions, ensuring that a similar proportion 
of buggy/non-buggy programs were in each split. This is 
necessary, since splitting by students can create very biased 


distributions, especially when we only have small labeled 
training sets. The stratification uses thresholds for 1) the 
ratio of bugs and 2) averaged submission numbers for stu- 
dents in respective bug groups. We argue that in practice, 
we should be able to select a similarly representative sample 
by manually checking several submissions to see if the distri- 
bution is fundamentally different. To ensure we evaluate our 
model performance with fair comparisons, we created 10 dif- 
ferent splits, generated randomly. All models use the same 
training/testing splits, and average performance metrics are 
reported as the results. For semi-supervised setting, we var- 
ied the size of labeled/unlabeled data to evaluate the per- 
formance of models. In order to perform fair comparisons, 
all semi-supervised models have the same labeled/unlabeled 
splits. Also, all models are tested on the same test sets, 
regardless of the model, the amount of training data, or su- 
pervised vs. semi-supervised. These settings ensured fair 
comparisons across different models. 


3.3. Model Settings 

SVM and XGBoost Parameters: We performed grid search 
on hyperparameters for SVM and XGBoost models using 
cross-validation on the training sets. In the SVM setting, 
we searched linear and Radial Basis Function (RBF) kernels, 
with C parameters in a range of (0.1 - 1), stepping by 0.1. 
In the XGBoost setting, we searched through situations that 
sub-sample portions from 0.1 to 1, stepping by 0.1, using 5 
to 100 estimators in the model. To prepare numerical input, 
we used TF-IDF feature extraction on the code submissions 
for both models. 


Code2vec and ASTNN Parameters: Since deep learning mod- 
els are more time- and resource-consuming, and our cold 
start experiments required many repeated runs (~ 100 runs), 
we did not perform automatic grid search; rather, we used 
default settings of the hyper-parameters and did manual 
changes. In code2vec, after observing the training and val- 
idation loss, we set the maximum training epochs as 200, 
with the patience of early stopping set to 100, and set the 
learning rate to 0.0002. Linear layer and embedding di- 
mensions were kept at the default value of 100. To ensure 
the highest efficiency of the model, we set the batch size 
as the full batch. These parameters are tuned with dif- 
ferent numbers, but little change in validation accuracy is 
observed. We also manually padded the number of paths to 
100 over all code submissions. In ASTNN, we padded the 
statement sequences to the maximum length to accommo- 
date the longest sequence before feeding to Bi-GRU. During 
training, we used 32 as batch size, 0.001 as learning rate, and 
keep the max training epoch as 50. The encoding dimension 
for the statement encoder was 128, and hidden neurons for 
Bi-GRU were 100. The weights were learned during training 
using the Adam optimizer for code2vec and ASTNN models. 


4. RESULTS 


4.1 Bug Detection Model Performance 

In this subsection we address RQ1: How well do state-of- 
the-art deep learning models for programming codes perform 
in a student bug detection task? Table 1 shows the results of 
the classifiers in the task of detecting bugs across problems. 
We use accuracy, Area-under-curve (AUC), Precision (P), 
Recall (R), and F1 score as the evaluation metrics for the 
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Figure 3: F1 score of models using supervised strategy with different portions of labeled data in detecting bugs. 


detection. Across the three bugs for detection, we observe 
that the top-performing model (ASTNN) achieved 0.560, 
0.868, 0.773 F1 scores on detecting comparison-off-by- 
one, assign-in-conditional, and and-vs-or respectively. 
The F1 scores indicate that the models achieved a higher 
performance on detecting assign-in-conditional compared 
to comparison-off-by-one and and-vs-or. In all bug de- 
tection tasks, ASTNN achieved the best F1 score on all bugs. 
On the two logical bugs, ASTNN achieved at least 0.226 
higher F1 score than SVM and XGBoost, while Code2vec 
also achieved at least 0.139 higher, showing deep learning 
models are more preferable for detecting the two logic bugs 
across the six problems. On the detection of the assign- 
in-conditional bug, ASTNN achieved a high F1 score of 
0.868, while a simple SVM model is able to achieve a 0.733 
F1 score, which is not much lower than the ASTNN model. 
However, the recall of SVM is low (0.627), which indicates 
a limited capability of detecting the bugs out of submis- 
sions with bugs. Code2vec model did not achieve better F1 
or AUC scores than SVM or ASTNN model in this case, 
showing that in the detection of syntax issues, paths fea- 
tures might be overly complicated. SVM might have a good 
performance when the real rule to learn is just “If it has = 
instead of == in the code, it has the bug,” since there is little 
contextual information to learn. Generally speaking, when 
using all 80% labeled dataset (~ 1493 programs on average), 
deep learning models have a better performance than tra- 
ditional machine learning models in detecting logical bugs, 
showing the advantage of leveraging structural information 
in the feature extraction step. 


4.2 Bug Detection with Limited Labels 

We address RQ2 in this subsection: How are deep learning 
models’ performance impacted by the amount of available 
training data? From Figure 3, we see the F1 scores of the 
four models in supervised strategy using a subset of labeled 
data. The x-axis is the log-scaled labeled data size, and the 
y-axis is the F1 score that models achieved across the 10 
different splits. The lowest portion of labeled data we use 
is 1%, which contains around 15 students, while the highest 
portion is 80%. The general trend of the supervised models 
shows that when more data is used, better F1 scores can 
be achieved by models, especially ASTNN. We also observe 
some interruptions in the increment of the performance as 
more data is available, meaning that it is not guaranteed 


that more data generate better models. For other baseline 
models, such a data-performance relationship is weaker, but 
still more data can generally produce better models. 


While the models expect better performance given more 
data, we would like to note that among all supervised mod- 
els, code2vec achieved better results than other models using 
a small subset of labeled data, showing a property of warm 
starting. With 10 percent of labeled data, code2vec has at 
least 7.5% higher F1 scores than any other models on all 
the three detected bugs. When more data is used, ASTNN 
outperforms other models, showing that there is generally at 
least one deep learning model more preferable than baseline 
models. When comparing code2vec with ASTNN, we find 
that deep learning models are not always “data-hungry”: al- 
though both models are are sensitive to data size, code2vec 
starts higher than baseline models in classifying all three 
bugs. To achieve a good detection result, using 30%-40% 
(560-747) less labeled data would create models achieving 
80% of the F1 score. 


With these results we are able to conclude the answer for 
RQ2: For code2vec and ASTNN, more data would produce 
models with better performance. However, the relation is 
not linear: ASTNN is more “data-hungry” than code2vec, 
but these deep learning models do not require lots of data 
points to perform better than baselines. 


4.3 Application of Semi-supervised learning 
This subsection addresses RQ3: To what extent does semi- 
supervised learning improve the performance of the deep learn- 
ing models? Figure 4 shows the semi-supervised learning 
results for all four models and the comparisons to super- 
vised ASTNN and code2vec models. The labeled training 
data for each split is exactly the same as ones used in super- 
vised settings. While the results give a mixed signal about 
whether semi-supervised learning is beneficial for all mod- 
els, we have two observations. 1) semi-supervised learning 
enhanced the learning of deep models, especially ASTNN in 
all three bugs. Comparing the black lines, we found that 
solid lines are always higher than dashed ones. It sug- 
gests ASTNN, as a more “data-hungry model”, is favored 
by the semi-supervised strategy more than in other models. 
Typically, an ASTNN model trained with a semi-supervised 
learning strategy achieves 0.05 to 0.2 higher F1 scores than 
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Figure 4: F1 score of models using semi-supervised strategy with different portions of labeled data in detecting bugs. Red circles 
noted places when semi-supervised strategy outperformed supervised training with full data. 


those trained in supervised learning strategy, using the same 
training dataset. In code2vec, which is also a deep learn- 
ing model, semi-supervised learning does not always help. 
It helps code2vec achieve better Fl scores when using a 
lower portion of data, but when given more data, a super- 
vised learning strategy provides better performance. Semi- 
supervised learning does not help much for the other two 
classical models, compared with deep models. 2) In semi- 
supervised learning scenario, ASTNN achieved a better per- 
formance when using 70% labeled data than using all 80% 
as training in detecting two bugs, assign-in-conditional 
and and-vs-or, by 2.8% and 7.1% respectively (red-circled 
in Figure 4). While this may reflect the fluctuation of data 
performance, we did run the models 10 times. This suggests 
that the model may be harnessing the semi-supervised learn- 
ing strategy to infer labels for unlabeled sets, and achieve 
more consistent labels than the authors, or some outliers 
present in the unlabeled set. We assume that the model then 
learned on these automatically inferred labels and achieved 
better results than learning from all expert labeled data. 


Our conclusion for RQ3 is that semi-supervised learning of- 
ten improves performance, especially when little training 
data is available. It enables the models to achieve an ex- 
pected performance with less labeled data than the super- 
vised scenario. Specifically, semi-supervised learning helped 
all cases in the learning of ASTNN models, and helped 
code2vec overall as well, especially when data size is low. 


5. DISCUSSION CONCLUSION 


Our results suggest three primary conclusions: 1) The two 
deep learning models generally outperformed baselines, and 
ASTNN had the best performance. Our results from Subsec- 
tion 4.1 show that deep learning models can detect simpler 
bugs, but still have a limited effectiveness on more com- 
plicated bugs (detailed in Subsection 3.1). The complex- 
ity of the comparison-off-by-one bug may be due to the 
difficulty of the labeling process, or its dependency on the 
problem context. 2) Deep learning models may still be suc- 
cessful when labeled data is limited. From the results in Sub- 
section 4.2, we learn that even if training with small data 
size such as < 100 data points in complicated programming 
data, the code2vec model is still able to outperform base- 
line models. 3) Semi-supervised learning has the potential to 
help deep learning models perform better. Semi-supervised 
learning helped code2vec to achieve a higher performance, 


but only when a small number of data points are labeled. 
One may assume the difference between the two deep learn- 
ing models come from the structures, but it may also come 
from the feature extraction process. Code2vec uses paths 
based features but ASTNN uses node based features, and 
recursively processed by neural networks. 


Our results can have other potential applications in edu- 
cational program analysis tasks as well. For example, as 
features are automatically extracted from student code dur- 
ing code2vec or ASTNN training, these features can be used 
to help instructor discover new bugs, as suggested by [33], 
which can help shape instruction. If more features such as 
problem requirements and test case inputs are available, we 
can apply these features to the model introduced by [30] 
to propagate instructor feedback to all students who would 
benefit form it. 


This work also has a couple of caveats or limitations as of 
the current stage. 1) We only performed extensive exper- 
iments on three bugs and used them to generalize to con- 
clusions. This is because the dataset labeling is time con- 
suming, requiring the authors to label ~ 1800 data points. 
The conclusions here may not generalize to other bugs or 
code classification tasks. 2) Similarly, these bugs also come 
from one programming assignment near the beginning of the 
course, focused on if conditions, and thus may be biased to 
this specific type of problem. 3) In the splitting process, 
we performed stratified sampling, requiring that test, la- 
beled, and unlabeled data be a similar distribution of class 
labels and the number of attempts. 4) Since we only com- 
pared our models with two classical model baselines, there 
may be other better models existing for better performance. 
We used our best effort to select representative models that 
achieve state of the art performance, but there might be 
better models available for the task as well. This work’s 
primary goal is to lay the foundation for using deep models 
in this task by exploring if the “data-hungry” property also 
applies here, and potential applications of semi-supervised 
learning. It serves as a step towards future model designs 
specific for automatic student bug detection, and provides 
guideline for situations when labeled data is limited. 
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ABSTRACT 


The quality of exams drives test-taking behavior of exam- 
inees and is a proxy for the quality of teaching. As most 
university exams have strict time limits, and speededness is 
an important measure of the cognitive state of examinees, 
this might be used to assess the connection between exams’ 
quality and examinees’ performance. The practice of ran- 
domization within university exams enables the analysis of 
item position effects within individual exams as a measure 
of speededness, and as such it enables the creation of a mea- 
sure of the quality of an exam. In this research, we use 
generalized linear mixed models to evaluate item position 
effects on response accuracy and response time in a large 
dataset of randomized exams from Utrecht University. We 
find that there is an effect of item position on response time 
for most exams, but the same is not true for response accu- 
racy, which might be a starting point for identifying factors 
that influence speededness and can affect the mental state 
of examinees. 


Keywords 
exam quality, computerized testing, item response time, item 
position effect, speededness, speed-accuracy trade-off 


1. INTRODUCTION 


The quality of standardized high-stakes tests can be seen as 
a driver of test-taking behaviors and mental states of test 
takers. The structured format of these tests, with strict 
time limits and high consequences attached to the test re- 
sult, lead test takers to a situation in which they feel more 
or less comfortable [19]. With the introduction and spread 
of computerized testing in high stakes tests, more data can 
be collected on high-stakes tests than in the past. These 
data can be used to monitor the quality of measurement in- 
struments of individual items and exams as a whole. Among 
the collected data, an important source of information are 
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response times. Response times can be used to gain more in- 
formation on the test-taking behavior of the examinees [17, 
1] and the functioning of the exam and exam questions. As 
a proxy for exam quality, higher education institutions com- 
monly use reliability measures such as the Cronbach’s alpha 
[5], although literature showed that this indicator, if speed- 
edness is present in the exam, might be underestimated lead- 
ing to reliability concerns [2]. However, reliability is not the 
only measure of exam quality: test takers behavior, in par- 
ticular speededness, might be relevant to lecturers and test 
creators. Thus, investigating the presence of speededness in 
a test is not only important to know whether the commonly 
used reliability measure can be trusted, but it can also be 
used to propose a new indicator of exam quality that takes 
into consideration the cognitive state of examinees, relating 
tests’ quality and examinees’ performance. 


Traditional measures of speededness only take into account 
whether examinees provide responses to all exam questions 
and are not missing a large proportion of items at the end 
of the exam [13]. However, fully missing responses at the 
end of the test is not the only way in which speededness 
manifests itself [16]. An important way in which time pres- 
sure can be observed is the increase of speed and decrease 
of accuracy close to the end of the test [11]. This behavior 
can be operationalized as the effect of item position on re- 
sponse time and response accuracy. When exam items are 
administered to all students in the same order, as often is 
in the case of traditional high-stakes achievement tests, the 
effect of item position cannot be separated from the effect 
of item properties. However, since for test security reasons 
exams are now more often administered with a randomized 
item order, it becomes possible to study item position effects 
separately from item effects. 


We have a large data set of computerized exams adminis- 
tered at Utrecht University between January 2015 and June 
2020. Using these data, we want to study the overall effect 
of item position on response time and response accuracy. 
Furthermore, for each exam we want to quantify the effect 
of item position on test performance which can be used as 
an indicator of test quality and of the mental state of test 
takers. To answer these questions, we focus on three key 
points. First, we uncover that responses to later items in 
exams have an increased speed, in conformity with previ- 
ous studies on anxiety and test strategies within high-stakes 
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tests [19, 8]. Second, we notice the lack of a relationship be- 
tween item position and accuracy that, if analyzed in parallel 
with our first finding, might give us indirect evidence that 
increased response speed does not seem to have the expected 
negative effect on accuracy. This might show that successful 
test-taking strategies might include increased response speed 
towards the end of a test, as previous qualitative analyses al- 
ready seem to show [18, 10], and that performance-reducing 
mental states do not appear to influence response speed. 


1.1 Problem Statement 


Currently, most tests used in European higher education in- 
stitutions are non-adaptive computerized tests, that often 
have a large number of multiple-choice items, no penalty for 
incorrectly responded items and use a test-based time limit 
instead of a section-based time limit, as is usual in adaptive 
assessments. High-stakes non-adaptive computerized tests 
have not been researched and investigated as frequently as 
their adaptive counterparts and datasets on these tests are 
not widely available. Thanks to the advance of computer- 
ized testing within Dutch higher education institutions, we 
now have a multitude of data available that were not avail- 
able before concerning high-stakes tests. Among these data, 
response times for each examinee on each item and the (ran- 
domized) item positions are saved. As test developers, when 
developing their tests, must find a balance between testing 
time requirements and difficulty and given that this balance 
depends on the type of the test and the needs of lecturers 
and students, we believe that exam-specific effects of item 
position on response time and response accuracy can provide 
them with useful information. Therefore, using the dataset 
at hand we set out to answer the following questions: 


1. What is the effect of item position on response time 
and accuracy? 


2. What can be inferred concerning test-taking behavior 
by analyzing the influence of item position on response 
time and accuracy? 


1.2 Contribution 


Answering our research questions, we make several contribu- 
tions to the field: (1) using generalized liner mixed models 
that make use of item position in a dataset of university 
exams, we analyze the effect of item position on response 
time and accuracy; (2) we provide indications concerning 
the relation between examinees’ response time and response 
accuracy within randomized high-stakes tests. 


This paper is structured as follows: in section 2, we pro- 
vide the background from which this work stems. Section 3, 
describes the data and the models used in the analysis. Sec- 
tion 4 continues comparing the results of the models fitted. 
Section 5 discusses the results of the models and provides 
the ground for the conclusion in section 6. 


2. BACKGROUND 


This research revolves around data collected at a higher ed- 
ucational institution concerning the results of tests adminis- 
tered using computers. The computerized collection of stu- 
dents’ answers enables the creation of a dataset containing, 
among other data, the response time data of each student 


on each test item. We want to make use of this information 
to help us better understand response processes and, as a 
consequence, improve measurement instruments. 


Interest in response time as a method of revealing informa- 
tion about mental activity has a long origin [15] and other 
research aims to be relatively comprehensive on the domain 
[6]. Here, we focus on the specific features on the approach 
relevant to our data and findings. Recently the role of re- 
sponse time modeling rose to a central position with novels 
works on the interplay between accuracy and response time 
[21, 9] and on item position [7]. Traditionally, two main 
effects of item position have been distinguished: a learning 
effect when items in later position become easier and a fa- 
tigue effect when items become instead more difficult [7]. In 
both cases, item position effects refer to the impact of the 
position of an item within an exam on the response time 
and on the response accuracy. Research commonly assumes 
that an increase in the speed of response will result in a 
decrease in accuracy [9]. This relationship, called speed- 
accuracy trade-off (SAT), is understood as a within-person 
phenomenon in which the accuracy of response varies with 
the time taken to produce it [11]. Our empirical findings 
provide some evidence that might enhance our understand- 
ing of the SAT and specify cases in which this relationship 
is more unclear than what previously thought. 


On the other hand, the psychology and education literature 
has long been interested in developing test designs that gen- 
erate fair results and thus studied examinees’ test-taking be- 
havior to investigate the effect of test designs. Among many 
domains, this literature also focuses on the effects of anxiety, 
motivation and test-taking strategies on performance when 
taking a test [19, 8, 3], finding that high achievers are more 
likely to engage in effective test-taking strategies compared 
to low achievers and identifying differences between genders 
in risk-taking behaviors and anxiety levels. Studies in this 
area identify risk-taking as an important strategy when tak- 
ing a test, in particular in multiple-choice tests under strict 
limits [3]. These guessing strategies are found to potentially 
lead to better results regardless of ability level and compared 
to students at the same ability level not using these types 
of strategies [8]. Our empirical findings provide some evi- 
dence also in this aspect, not finding a negative relationship 
between speeded behavior and response accuracy. 


3. METHODS 
3.1 Data 


We use data from Utrecht University that comprises all ex- 
ams carried out using the online platform Remindo Toets' 
between 2015/01/01 and 2020/06/01. Given our goal of in- 
vestigating the effect of item position, we select exams in 
which response randomization was applied (i.e., the position 
of the questions given to examinees changes from examinee 
to examinee). Therefore, the starting dataset of exams is fil- 
tered on the following conditions: (1) duration of the exam: 
less than 240 minutes. (2) Number of examinees: at least 
100. (3) Number of items per examinee: at least 10. (4) 


'Remindo Toets is a software product developed by Paragin, 
a Dutch education company, which provides educational in- 
stitution with a platform to create, administer, review and 
grade exams. 
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Types of questions in the exam: “choice”, “inline-choice”, 
“order”, “match”. (5) Maximum response time: less than 
600 seconds, to reduce outliers in the dataset. (6) Finally, 
we only analyze exams in which the item order is fully ran- 
domized. After filtering, the dataset contains 599.519 item 
responses. In the final dataset, tests are composed by an 
average of 204 students, 34 items and the average duration 
is of 106 minutes. 


For each question, lecturers are provided with the so-called 
p-value, as a measure of “difficulty” of the item, and the es- 
timates of commonly used metrics the item-test correlation 
(RIT) and the item-rest correlations (RIR) [20]. These vari- 
ables are available along with item responses and response 
times. On the dataset at hand, the mean response time is of 
91.18 seconds while the average accuracy rate is of 63.19%. 


Due to privacy reasons, this dataset was anonymized. Be- 
cause of this, we are only able to provide a general overview 
of the exams used in this analysis and not a complete overview 
of the underlying students population. The dataset at hand 
consists of 90 unique exams across 6 faculties within Utrecht 
University. In order, the largest faculties are Science, Vet- 
erinary Science and Social Sciences. Finally, as we selected 
courses with more than 100 examinees and due to the dif- 
ference in the average class size between bachelor and mas- 
ter courses at Utrecht University, the wide majority of se- 
lected exam were from bachelor programs which are typi- 
cally taught in Dutch. The predominance of the exam in 
Dutch (90%) indicates that the majority of students attend- 
ing these courses courses are Dutch, implying that the re- 
sults of these analyses might be culture-specific. 


3.2 Selected variables 


In the context of our analysis, we make use of the following 
variables: 


e Student: factor variable identifying each individual 
student. Total number of factors: 18.476. 


e Test: factor variable identifying each individual test. 
Total number of factors: 90. 


e Item: factor variable identifying each individual item. 
Total number of factors: 5.089. 


e Response time: continuous variable referring to the 
total time, in seconds, spent by an examinee on an 
item. The response time is the summed response time 
across all attempts made in answering that item. Clear 
extreme outliers in item were eliminated by setting a 
cutoff in the filtering process of the data. 


e Accuracy: binary variable referring to a right or wrong 
answer by an examinee on an item. 


Additionally, we create the following two variables: item po- 
sition and available time per item. The first variable, item 
position, is used to identify the location in which the item 
appears within a test and it is divided within 10 blocks rep- 
resenting 10% of the exam. For each response of person i to 
item j in exam k, zi; € [0 : 9] denotes the block in which 
the item was presented to the person. We also create a set 


of dummy variables 21;;x,..., Z9ijn, Where Zsijz if zijn = 8, 
in order to model nonlinear relationship between item posi- 
tion and exam performance. The second variable, available 
time per item, is created dividing the total allotted time for 
a specific exam by the number of items in that exam. This 
variable is created as the exams available in the dataset are 
heterogeneous and do not have a common time limit. As 
we cannot compare exams having different time limits, we 
create a variable that represents the time limit at an item 
level. Across all exams, the available time per item has a 
mean allotted time of almost 3 minutes (177 seconds). 


3.3. Models 


Before discussing the results of our models, we make a brief 
note of the reason underlying their creation. A key necessity 
in our models is the ability to quantify the effect of item posi- 
tion on response time and on response accuracy. Therefore, 
to build models we turn to generalized linear mixed models 
(GLMMs). We choose GLMMs as we aim to develop mod- 
els that create a reliable and easily repeatable analysis to 
increase the reach and applicability of the model results to 
datasets from other educational institutions. 


In both models, to study the effects on response accuracy 
and response time, we consider three predictors. We allow 
for random variability across students incorporating this ef- 
fect as random effect (@1:% for the effect on response accu- 
racy, and 62;x for the effect on response time). We consider 
fixed item effects (G1;x and 325% for the effects of item j from 
exam k on response accuracy and response time, respec- 
tively). Finally, for each of the item position dummy vari- 
ables we consider fixed effects on response accuracy (isk) 
and on response time (72s%), which are estimated for each 
exam separately. We include fixed item position effects only 
in the second variation of both models in order to enable us 
to evaluate whether their addition is significant. 


3.4 Modeling response time 
We construct the models concerning response time using 
the logarithm transformation of response time. For response 
time, we focus on the following linear mixed effect regression 
(LMER) models: 
Yijk = Page + OrEx 
O2ik © N(0, 0%) 
9 
Yyijk = Boje + So ask Zsigk + O2ik 
s=l1 (2) 
O2ik ~ N(0, 03x) 
wher yijx is the log-transformed response time of person 7% 
on item 7 in exam k, and o2% is the variance of the person 
random effect on response time in exam k. Model 1 does 
not contain the fixed effects of item position. 


(1) 


3.5 Modeling response accuracy 

For response accuracy, we focus on the following generalized 

linear mixed effect regression on a binary variable (GLMER): 
logit(xijk) = Bije + O14k 


3) 
3 ( 
fiik ~ N(0,o%1n) 

where x;;~ denotes response accuracy of person 7 on item 
j in exam k, and 0%, is the variance of the random effect. 
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The model is extended with the effect of the item position 
dummy variables: 
9 


logit(xijr) = Bij + So Yiaktizis + Arik (4) 
s=l 


O1izn ~ N(O, oir) 


4. RESULTS 


We first analyze the results of response time models from 
equation 1 and 2, before moving to the results of the re- 
sponse accuracy models from equation 3 and 4. Because of 
limitations due to the size of data at hand, in particular due 
to the number of fixed effects, all models are fit on each exam 
individually and their effects are shown as the gray lines in 
Figure 1 and Figure 2. After fitting each individual model, 
the item position effects estimates are pooled together with 
a random-effect meta-analysis in which we the exam-specific 
effects are assumed to come from a distribution with a com- 
mon mean and variance [4]. The means of the effect and the 
+1.96 times the standard deviation of the effect boundary 
are shown as the blue and light blue lines in Figure 1 and 
Figure 2. The estimates of each of the item position effects 
of the pooled model of LMER 2 and GLMER 4 are given 
in Table 1 and Table 2 in the Appendix. 


4.1 LMER on log response time 

Concerning LMER models on response time, we see that 
across exams the ANOVAs between model 2 model 1 are 
significant in 79 out of the 90 cases. Respectively 72 exams 
at the .1% significance level, 5 at the 1% significance level 
and 2 at the 5% significance level. The average additional 
proportion of explained variance of model 2 on model 1 is 
0.0084. 


Analyzing response time behavior and the effect of item po- 
sition, we observe a significant negative effect of item posi- 
tion on response time. In particular, we observe an increase 
in the effect over the course of the exam, which is typically 
defined as response acceleration. This can be seen in Figure 
1 as each item position relates to the effect size compared 
to the baseline of item position 0, which represents the first 
10% of the exam. The blue line represents the pooled esti- 
mates of the item position effects across all exams, while the 
gray lines represent individual exams. 


4.2. GLMER on response accuracy 

With respect to GZMER models on response accuracy, we 
see that model 4 does not significantly outperform model 4. 
The ANOVAs between model 4 and model 3 are only signif- 
icant 5 times at the 1% significance level and 11 at the 5% 
significance level. This shows that the response acceleration 
found previously and shown in Figure 1 either is not strong 
enough to influence response accuracy or does not have any 
influence on it. Hypothesis on the reason underlying this 
finding are discussed in section 5. 


4.3 Proportion of explained variance and avail- 
able time per item 
After running a likelihood-ratio test on each exam individ- 


ually, we select model 2 as significantly better than model 
1 while we further investigated the additional proportion of 
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Figure 1: Item position effects on response time 
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Figure 2: Item position effects on accuracy 


explained variance of model 2 regressing it on the available 
time per item for each exam. Figure 3 shows that there 
is a relationship between the available time per item in an 
exam and the additional explained variance from the model 
with item position effects (linear correlation —.37). This is 
an indicator that on exams that have less time available, 
response acceleration is indeed happening and the inclusion 
of a variable to take into account the position of the item 
help us explain better the behavior of students. However, 
as it is visible in Figure 3, the additional explained variance 
of model 2 is relatively low. This result is important as it 
provides us with a tool to support the results of response 
acceleration. 


5. DISCUSSION 


5.1 Discussion of findings 

Using a collection of item responses and response times from 
higher education tests during a five-years time span, we an- 
alyze the relationship between item position and response 
time and between item position and response accuracy. We 
show that item position is associated with response accelera- 
tion while we find that the connection between item position 
and response accuracy is unclear. Finally, we also find that 
the available time per item is negatively correlated with the 
additional explained variance when comparing the model re- 
lating item position and response time. 
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Figure 3: Regression between mean duration of item and 
additional explained variance for model 2 


Concerning the significant effect between item position and 
response times, we see that the effect is not as strong as we 
initially expected. The presence of this effect might stem 
from increased respondents’ fatigue or decreased interested 
in the exam, as highlighted by previous work on this topic 
[22, 9]. However, previous literature focuses on adaptive 
tests in which items becomes more or less hard in relation 
to respondent’s performance, while in our dataset this is 
not the case. Comparing the differences between these two 
modalities of testing and understanding the difference in the 
response behavior might be an attractive opportunity for 
future research. 


With regards to the interaction between response accuracy 
and item position, literature tends to show that later items 
are more difficult [14, 22] and therefore might decrease the 
rate of correct responses. We find no significant relationship 
between response accuracy and item position. The lack of 
this relationship might be explained by a few hypotheses. 
First, we might hypothesize that the response acceleration 
shown is the representation of students reaching their nat- 
ural speed on the exam. Students might take some time 
to enter in the mental state that allows them to take an 
exam and they might be initially slowed down by the need 
of understanding how the exam is structured. This might 
explain the increased acceleration of response at the same 
level of accuracy between earlier and later item in the test. 
Secondly, we might hypothesize that there is an negative ef- 
fect of response acceleration on response accuracy, but it is 
not shown because of the relatively heterogeneous dataset 
of exams at hand and because of the need of accessing more 
data. Concerning the second hypothesis, more work on a 
larger dataset of similar exams is needed before drawing any 
conclusion. Finally, the lack of this relationship might also 
be explained by the presence of increased response speed 
within effective test-taking strategies among high achievers, 
as found by previous qualitative literature on this topic [18, 
12]. This result might be caused by an increased willing- 
ness to guess using effective elimination processes, leading to 
an effective guessing strategy on high-stakes multiple choice 
tests, when compared to the choice of picking an answer 
option at random. 


Finally, we demonstrate that there is a relationship between 
the additional explained variance and the available time per 
item. When the model with item position effects is sig- 
nificantly more informative than the model without these 
effects, this correlation might also provide backing to a po- 
tential quality indicator to be provided to lecturers and test 
creators to inform them about their tests. In the presence of 
high additional explained variance, an indicator that would 
take this relationship into account might be used to provide 
lecturers with information about the quality of their tests 
and to identify exam-specific factors that influence speeded- 
ness and the test-taking strategies of examinees. 


5.2 Limitations 

When carrying out the modeling part of our research, due 
to the size of data and factors at hand, we realized that the 
current statistical methods available to analyze this quantity 
of data create computational problems. As a matter of fact, 
we were interested in analyzing the effect of item position 
on response time and accuracy across the entire dataset but, 
due to computational limitations, we decided to fit the mod- 
els on individual exams and later pool the effects estimates 
using a meta-analysis. To avoid this obstacle, two parallel 
path might be taken: (1) extending the current statistical 
libraries to include the possibility of using sparse matrices 
in computing fixed effects estimates and (2) adding more 
computational power to the tools used in the analysis. 


Further, we also need to take into consideration both the 
dataset used in this research and the filtering actions taken 
on it. First, the dataset stems from a higher education insti- 
tution (wetenschappelijk onderwijs) and therefore the results 
of our analyses might be dependent on the educational level 
of the students’ population. Secondly, because of the filter- 
ing actions carried out on the dataset (2), we can assume 
that Dutch students are more represented in the dataset 
than international students. This might imply that the re- 
sults stemming from our analyses are highly dependent on 
the Dutch test-taking ”culture”. 


An important distinction between our work and previous 
studies on speededness, such as [9], is that we attempted to 
remove the effects of very long answers, but not of answers 
given during rapid guessing behavior [15]. As a matter of 
fact, the validity of test scores of such tests are threatened by 
what [23] call noneffort, which is associated with the guess- 
ing behavior of an examinee who does not try to solve items. 
The effect of such mechanism is an underestimation of his 
or her actual level of proficiency, threatening the validity of 
test score by adding a source of construct-irrelevant vari- 
ance [23]. An analysis about the presence and the methods 
to identify noneffort in this dataset might reveal pathways 
to either take it into account or eliminate it from the dataset, 
paving the way for an analysis that compares the estimates 
of ability accounting only for truthful response acceleration. 


5.3 Future research 

Expanding dataset. As noted in section 5.2, we believe ex- 
panding the dataset at hand, including most recent data, 
and including other universities, would help providing more 
information on the relationship between response time and 
accuracy and thus the creation of more accurate indicators 
for lecturers. Moreover, as there might be differences in 
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the “culture” of examination between countries and between 
educational levels, a larger dataset comprising of multiple 
countries and different educational levels might help clarify- 
ing these questions. 


Creating a metric to provide recommendations. As we did 
not see a clear effect of item position on accuracy, a future 
path for research might assume that, in the presence of sim- 
ilar results, students are given too few items for the time 
allotted on a specific exam. This would open a path for 
creating experiments to increase the total number of items 
on an exam and analyze the cutoff value at which effects 
on accuracy start appearing. An experiment increasing the 
number of test items would not only clarify the values at 
which effects appear, but would also improve domain cov- 
erage for that specific exam and, improving its reliability 
metric, and would allow us to study changes in test-taking 
strategies within the test. 


6. CONCLUSION 


This research evaluates methods to investigate the effects 
of item position on response time and accuracy. We find 
that, thanks to the advancement in the technologies used in 
exam settings and the wide application of these technologies 
in high-stakes tests, these analyses are not only feasible but 
are also promising if applied on larger datasets. We believe 
the overall results of the models can be of use within the 
educational sector, in particular thanks to the creation of an 
additional and reliable indicator of the quality of an exam. 
Using the results presented in section 4, we can now answer 
our research questions: 


1. We find a small but significant effect of item position 
on response time, while we fail to find any effect of item 
position on accuracy. Section 5 provides a discussion 
on the findings. 


2. We believe our results stemming from modeling item 
position and response time and the subsequent regres- 
sion of the additional explained variance of this model 
and the available time per item, could evolve into a 
practical indicator providing more information con- 
cerning the quality of a test and the test-taking be- 
havior of students. 


7. REFERENCES 

1] A. P. Association. Standards for educational and 

psychological testing, 2014. 

2| Y. Attali. Reliability of speeded number-right 

multiple-choice tests. Applied Psychological 

Measurement, 29(5):357-368, 2005. 

3] K. Baldiga. Gender differences in willingness to guess. 

Management Science, 60(2):434—448, 2014. 

4] M. Borenstein, L. V. Hedges, J. P. T. Higgins, and 
H. R. Rothstein. Introduction to Meta-Analysis, 
volume 8. John Wiley & Sons, Bridgewater, NJ, USA, 
jan 2009. 

[5] L. J. Cronbach. Coefficient alpha and the internal 
structure of tests. psychometrika, 16(3):297-334, 1951. 

[6] P. De Boeck and M. Jeon. An overview of models for 
response times and processes in cognitive tests. 
Frontiers in psychology, 10:102, 2019. 


[10] 


11 


12 


13 


14 


[15] 


[16] 


[17] 


[18] 


[19] 


[20] 


[21] 


[22] 


D. Debeer and R. Janssen. Modeling item-position 
effects within an irt framework. Journal of 
Educational Measurement, 50(2):164—185, 2013. 

H. Dodeen. Assessing test-taking strategies of 
university students: developing a scale and estimating 
its psychometric indices. Assessment & Evaluation in 
Higher Education, 33(4):409-419, 2008. 

B. Domingue, K. Kanopka, B. Stenhaug, J. Soland, 
M. Kuhfeld, S. Wise, and C. Piech. Interplay between 
speed and accuracy: Novel empirical insights based on 
1/4 billion item responses, Mar. 2020. 

A. P. Ellis and A. M. Ryan. Race and cognitive-ability 
test performance: The mediating effects of test 
preparation, test-taking strategy use and self-eff icacy. 
Journal of Applied Social Psychology, 
33(12):2607—2629, 2003. 

R. P. Heitz. The speed-accuracy tradeoff: history, 
physiology, methodology, and behavior. Frontiers in 
Neuroscience, 8:150, 2014. 

E. Hong, M. Sas, and J. C. Sas. Test-taking strategies 
of high and low mathematics achievers. The Journal of 
Educational Research, 99(3):144-155, 2006. 

Y. Lu and S. G. Sireci. Validity issues in test 
speededness. Educational Measurement: Issues and 
Practice, 26(4):29-37, nov 2007. 

G. Nagy, B. Nagengast, M. Becker, N. Rose, and 

A. Frey. Item position effects in a reading 
comprehension test: an irt study of individual 
differences and individual correlates. Psychological 
Test and Assessment Modeling, 60(2):165-187, 2018. 
T. C. Oshima. The effect of speededness on parameter 
estimation in item response theory. Journal of 
Educational Measurement, 31(3):200-219, 1994. 

D. L. Schnipke and D. J. Scrams. Modeling item 
response times with a two-state mixture model: A 
new method of measuring speededness. Journal of 
Educational Measurement, 34(3):213-232, 1997. 

D. L. Schnipke and D. J. Scrams. Exploring issues of 
examinee behavior: Insights gained from response-time 
analyses. Computer-based testing: Building the 
foundation for future assessments, 34:237—266, 2002. 
T. Stenlund, H. Ekl6f, and P.-E. Lyrén. Group 
differences in test-taking behaviour: An example from 
a high-stakes testing program. Assessment in 
Education: Principles, Policy & Practice, 24(1):4—20, 
2017. 

T. Stenlund, P.-E. Lyrén, and H. Ekl6éf. The successful 
test taker: exploring test-taking behavior profiles 
through cluster analysis. European Journal of 
Psychology of Education, 33(2):403-417, 2018. 

U. University. Toetsanalyse in remindo v2.1. 

https: //remindo-support.sites.uu.nl/ 
wp-content/uploads/sites/79/2019/11/ 
Toetsanalyse-in-REMINDO-v2.1-003.pdf, Nov. 2019. 
Accessed: 2020-10-30. 

P. W. van Rijn and U.S. Ali. A generalized 
speed—accuracy response model for dichotomous 
items. psychometrika, 83(1):109-131, 2018. 

S. Weirich, M. Hecht, C. Penk, A. Roppelt, and 

K. Béhme. Item position effects are moderated by 
changes in test-taking effort. Applied psychological 
measurement, 41(2):115-129, 2017. 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 459 


[23] S. Wise and C. DeMars. Examinee noneffort and the 
validity of program assessment results. Educational 
Assessment, 15(1):27—41, 2010. 


APPENDIX 
A. ESTIMATES OF POOLED MODELS 


Table 1: Coefficients estimates of pooled model LMER 2 


Item Position | Estimate Std. Error Estimate Rand. Std. Error Rand. tau 
1 -0.0539 0.0036 -0.0628 -0.0628 0.0658 
2 -0.0625 0.0035 -0.0679 -0.0679 0.0621 
3 -0.0775 0.0036 -0.0775 -0.0775 0.0619 
4 -0.0949 0.0035 -0.0983 -0.0983 0.0655 
5 -0.1169 0.0035 -0.1203 -0.1203 0.0663 
6 -0.1309 0.0035 -0.1343 -0.1343 0.0729 
7 -0.1547 0.0036 -0.1605 -0.1605 0.0794 
8 -0.1750 0.0035 -0.1854 -0.1854 0.0835 
9 -0.1879 0.0034 -0.1989 -0.1989 0.1062 


Table 2: Coefficients estimates of pooled model GEMER 4 
Item Position | Estimate Std. Error Estimate Rand. Std. Error Rand. tau 


dl -0.0412 0.0156 -0.0393 -0.0393 0.1148 
2 -0.0412 0.0151 -0.0480 -0.0480 0.1193 
3 -0.0163 0.0155 -0.0185 -0.0185 0.1279 
4 -0.0283 0.0152 -0.0308 -0.0308 0.1127 
5 -0.0264 0.0154 -0.0335 -0.0335 0.1161 
6 -0.0106 0.0152 -0.0144 -0.0144 0.1046 
7 0.0007 0.0154 -0.0009 -0.0009 0.1203 
8 0.0141 0.0152 0.0141 0.0141 0.1282 
9 0.0157 0.0147 0.0103 0.0103 0.1268 
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ABSTRACT 


Various similarity measures for source code have been pro- 
posed, many rely on edit- or tree-distance. To support a 
lecturer in quickly assessing live or online exercises with 
respect to approaches taken by the students, we compare 
source code on a more abstract, semantic level. Even if 
novice student’s solutions follow the same idea, their code 
length may vary considerably — which greatly misleads edit 
and tree distance approaches. We propose an alternative 
similarity measure based on variable usage paths (VUP), 
that is, we use the way how variables are used in the code 
to elaborate code similarity. The final stage of the mea- 
sure involves a matching of variables in functions based on 
how the variable is used by the instructions. A preliminary 
evaluation on real data is presented. 


Keywords 


source code distance, semantic analysis, variable usage paths 


1. INTRODUCTION 


Learning a programming language requires a lot of practice. 
Students gain knowledge and experience from both, home- 
work and in-classroom exercises. It is, however, very impor- 
tant to give students feedback, e.g. discuss different solu- 
tion approaches with them. Both, those who have and have 
not yet succeeded, will benefit from such discussions, either 
it widens their view (because they may not have thought 
about alternative approaches yet) or encourages them to 
try the exercise at a later point in time once more (once 
they got a glimpse on how to solve it). The sharing of dif- 
ferent solution approaches and discussing the pros and cons 
of different approaches improves their algorithmic thinking 
skills. However, it is quite common in many programming 
courses, that the only (automatic) feedback consists of the 
number of passed or failed unit tests. Achieving a positive 
feedback then requires an already well developed solution 
and no credit is given for, say, getting the code structure 
right — which may frustrate novices noticeably. 
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We thus address the following research question: Can we 
support the lecturer in providing meaningful feedback about 
the solution approaches taken by the students? This requires 
automation as a lecturer does not have the time to review 
all solutions manually, especially not in online teaching situ- 
ation. Providing meaningful feedback requires some kind of 
insight in the solution approach a particular student was fol- 
lowing (semantics), how common the approach is, how many 
different approaches have been followed in the course, etc. 
We intend to achieve this by providing a similarity measure 
for source code that does not focus on results (as unit tests 
do) but to reach a higher semantic level by assessing the 
more abstract code structure: different solution approaches 
manifest themselves in different code structures. 


Such a measure would be useful in many ways. For in- 
stance, during an in-classroom teaching situation it would 
enable the lecturer to pick two (or more) submissions fol- 
lowing different approaches to start a discussion about their 
pros and cons. It could also support a lecturer in browsing 
the spectrum of approaches that were followed by the course 
members (how many different approaches were chosen how 
often) without having to check every solution manually. It 
may enable the lecturer to pick a solution that follows the 
same approach as a solution that has been discussed already, 
but did not pass all unit tests, thereby representing a live 
challenge to the course members (“spot the error”). Note 
that we are not aiming at grading the source with respect 
to its correctness, but leave this to the unit tests. As tests 
do not provide any useful feedback to students that do not 
yet have an appropriate code structure, the desired measure 
may close this gap, as it would allow us to differentiate code 
submissions that are far from working from those that follow 
a reasonable solution approach and got the code structure 
right, but only the tiny details prevents them from passing 
the tests. Such a more detailed inspection would enable a 
much more sensitive (automatic) feedback. 


2. RELATED WORK 


Similarity or distance measures for source code have been 
investigated for a long time. Many approaches have in com- 
mon that they start from an abstract syntax tree (AST) that 
represents the code. Code is then compared by calculating 
some kind of tree distance on the corresponding ASTs: the 
minimal number of steps (node deletion, insertion, and re- 
labelling) to transform one AST to the other. A survey on 
tree edit distance can be found in [2]. A tree, however, has 
no conscience about variable identity: the very same vari- 
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able occurs over and over again as a new node in an AST. 
To reflect which variable change affects other code, program 
dependency graphs (PDG) may be used. But comparing 
graphs is even more complicated than comparing trees, a 
survey on graph edit distance is given in [6]. To simplify the 
comparison, the AST may also be linearized such that much 
simpler string comparison measure (edit distance) may be 
applied (cf. [5, 7, 9]). 


An extremly fast approach is to define a hash function on 
(possibly pre-processed trees) such that a similarity is de- 
tected when hash codes are identical [1]. The flexibility of 
such an approach is, however, very limited, as the similiarity 
measure is a binary one. It may nevertheless be useful for 
clone detection (plagiarism) or may be enhanced by calcu- 
lating multiple hashes [4]. Tree edit distance has been used 
in, e.g., [10] to detect similar source codes. In [3] a distance 
measure for source code has been proposed, that transforms 
the AST into a sequence of tokens on which Levenshtein or 
edit distance is applied. 


In many related papers the goal is to compare (student’s) 
code to a (lecturer’s) reference code; the higher the simi- 
larity, the closer is the student’s submission to the correct 
solution. In this work, however, we want to identify source 
codes that are semantically similar, that is, they follow the 
same solution approach. Especially when novices start to 
code, their solutions are often more lengthy and redundant 
than those of an experienced programmer. This affects mea- 
sures such as edit distance or tree distance dramatically. The 
approach in [3] did not work out as well as expected, and 
the authors hypothesized about one of the reasons that the 
length of the code dramatically influences the edit distance. 
While longer code may be less readable and more redundant, 
it is neither more likely wrong nor does it necessarily follow 
a different solution approach. As an example, consider codes 
(a) and (b) from Fig. 1. The task was to calculate the mean 
of all positive array elements. Both codes follow the same 
approach, but (b) uses a single loop instead of two and is 
thus more compact. But semantically, we consider both so- 
lutions as being identical. We are not aware of any source 
code similarity measure that tries to reflect that. 


It is also common in the literature to replace variable names 
by a constant string (possibly depending on the variable’s 
type), which eases matching sources that use different vari- 
able names. In plagiarism detection (see [5, 9] and references 
therein) this is considered as a countermeasure against dis- 
guising plagiarism by variable renaming. However, the way 
a single variable is used throughout the code is crucial for 
a solution approach. Fig. 1(d) utilizes the same statements 
(e.g., array-access in conditional statement of a for-loop), 
but does not calculate anything meaningful and thus does 
not represent the same solution approach as codes (a-c). 


3. COMPARING SOURCE CODE BY 
VARIABLE USAGE PATHS 


We argue that two solutions follow the same approach if they 
use variables in the same way. In the subsequent sections 
we show how we grasp variable usage in the code and then 
define similarity measures on this representation. 


public double avg(double a[]) {public double avg(double al]) { 
if (a==null) return NaN; if (a==null) return NaN; 
int n = 0; int n = 0; 
for (int j=0;j<a.length;++j) { double sum = 0; 
if (a[j]>0) ++n; for (int i=0;i<a.length;++i) { 


if (afi]>0) { 
double sum = 0; +-+n; 
for (int j=0;j<a.length;++j) { sum+=ali]; 
if (alj]>0) sum+=alj]; } 


return sum/n; 
} // (a) [TwoLoopQuickExit] 


return sum/n; 
} // (b) [(OneLoopQuickExit] 


public double avg(double a[]) { public double avg(double al]) { 
if (al=null) { if (a==null) 
int n = 0; return NaN; 
double sum = 0; int n = 0; 
for (int i=0;i<a.length;++i) { double sum = 0; 


if (afi]>0) { for (int i=0;i<a.length;++n) { 
++; if (ali]>0) { 
sum+=ali]; ++i; 
bt a[i]+=sum; 
return sum/n; } 
} else { } 


return NaN; 
} } // (c) [OneLoopGuardedIf] 


return sum; 
} // (a) [Disordered] 


Figure 1: Responses to a simple programming exer- 
cise: Write a function avg that yields the mean of all 
positive values in the array. Left: Code (a) uses two 
loops, while code (b) only one. While the main code 
is organized after an if-statement in (a) and (b), it is 
embedded in the conditional statement in (c). Code 
(d) uses similar instructions, but the variable usage 
is screwed up and does not solve the problem. 


3.1 Variable Usage Paths 


Solving a programming exercise requires to combine pro- 
gramming instructions such that a handful of variables 
jointly build up the final result (and return it to the caller). 
The key to a solution are thus the variables and how they 
are embedded in the (possibly nested) instructions. Code 
analysis often starts with an abstract syntax tree (AST); 
there, every variable usage is represented by a new node in 
the tree. This occludes important information: Where in 
the source code is the same variable used? We therefore 
rearrange the AST to a graph, where a node is unique for 
each variable and subsequent usages of the variable link to 
the same node. Fig. 2 shows an example for the code of 
Fig. 1(b). All variables are marked in red color. From the 
paths between the variable sum and function avg, we can 
see that the variable sum is declared, occurs in a conditional 
statement that is embedded in a loop, and finally occurs in 
the return value. We consider these paths (shown in blue in 
Fig. 2) as a kind of fingerprint for the role of this variable: 
code following the same approach requires variables with the 
same roles. 


We thus do not operate directly on the graph, but paths from 
variable nodes to the enclosing function node. After apply- 
ing some transformations to simplify the graph somewhat 
(e.g. removing body or replacing for, while, etc. by a subsum- 
ing label loop), we end up with string representations of the 
three blue paths like sum/expression/return/avg, sum/dec- 


462 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


statement_3 


pperand 


Figure 2: Graph representation of code in Fig. 1(b) (red: variable nodes; blue: VUPs of variable ’sum’). 


laration/declare/avg and sum/assignment/if/loop/avg. 


Definition 1. Let I be a set of code instruction labels 
(such as if, loop, ...). For a given source code C (single 
class), let Vo be a set of variable-identifiers! and Fo the set 
of methods declared in C. A path p = (v,%1,...,in, f) € 
Vo x I* x Fo := P reflects that a variable v is used by 
instruction 71, which is itself used by instruction 72, etc., 
in function f. The VUP-representation (variable usage 
path) of code C is a set Pe C P of all paths occurring in C. 


We thus intentionally drop many details from the original 
source code, e.g. numerical constants or full expressions. 
From the fact that two codes that are structurally identi- 
cal (same VUP representation) we cannot conclude anything 
about the number of unit tests both codes may pass. But 
as we have mentioned earlier, we seek for a common code 
skeleton, which may indicate that the programmers were 
guided by the same underlying idea. The skeleton includes 
the information which variables need to be used in which 
instructions — but there are many different ways of coding 
expressions equivalently, so we simply stick with unit tests 
to check their correctness. 


3.2 Simplified Set Similarity 


Given two source codes (a) and (b) from Fig. 1 and the cor- 
responding VUP-representations P,Q. Code (a) determines 


‘Note the variable names themselves are not valid identi- 
fiers, as the same variable name may occur more than once 
in the same function, e.g. variable j in code (a) of Fig. 1. 
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the number of elements first before it sums the relevant val- 
ues, while code (b) does both in a single loop. While the 
choice of one or two loops may affect the efficiency, they are 
semantically equivalent and thus similar. At first glance the 
variable usage paths (of, say, the loop counter) in all loops 
seem identical such that a set comparison (where duplicate 
elements are not counted) appears just right. However, the 
loop counter is named j in code (a) and i in code (b). Fur- 
thermore, (a) defines two variables with the same name j 
(one for each loop). Let us ignore these details by introduc- 
ing a simplification (and revisit the problem later): 


Definition 2. From a VUP representation P we obtain a 
SVUP representation (simplified VUP) representation P’ 
by replacing all variable identifiers and all method identifiers 
by a constant identifier vn (generic variable name) and fn 
(generic function name), resp. (I = {un}, F = {fn}). 


Standard methods to measure set similarity may then be 
used to compare source code. For given SVUPs P and Q we 
use the F, measure’: 


IPAQ _ IPAQ 
[P| 1Q| 

Fig. 3 shows a dendrogram for an average-linkage cluster- 

ing using this measure (actually 1 — F) to obtain a distance 

measure). Codes (a)-(d) correspond to TwoLoopQuickExit, 


OneLoopQuickExit, OneLoopGuardedlf and Disordered. Addi- 
tional examples include: Nonsense (similar set of instruction 


pa5. 2 
ptr 


where p= 


although based on asymmetric precision p and recall r, Fy 
itself is symmetric and thus a similarity measure 


463 


-— Disordered 


— OneLoopQuickExitTemp\ 


TwoLoopQuickExit 
OneLoopQuickExit 
Nonsense 


OneLoopGuardedElse 
OneLoopGuardedlf 
Empty 


0.8 0.6 0.4 0.2 0.0 


Figure 3: Dendrogram for source codes of Fig. 1 
based on F)-measure on VUP representation. 


Table 1: Example of comparing P and Q (see text). 


P 1. vn/expr/if/loop/fn d m(d) 
2 vn/assign/if/loop/fn if (1,6), (2,7) 
3 vn/declare/fn assign (3,8) 
4 vn/assign/fn declare (4,8) 
5 vn/return/fn 

Q 6. vn/expr/loop/fn step un- / assigned 
7  vn/assign/loop/fn 0 1,2,3,4,6,7,8 / 5,9 
8 vn/assign/declare/fn 1 3,4,8 / 1,2,5,6,7,9 
9 vn/return/fn 2 4 / 1,2,3,5,6,7,8,9 


set but variables are mixed randomly), OneLoopQuickExit- 
TempVar (the mean value is assigned to a temporal vari- 
able before it is returned) and OneLoopGuardedElse (same 
as OneLoopGuardedlf (Fig. 1(c)) but inverted if-condition). 
From the dendrogram we learn that codes TwoLoopQuick- 
Exit and OneLoopQuickExit become identical as desired. But 
we can also see that code OneLoopGuardedlIf, where a con- 
ditional statement encloses the main code, is recognized as 
very dissimilar. The additional ifstatement occurs in all 
paths and the simple Jaccard similarity finds this code to 
be completely different. We address this problem next. 


3.3 Reflecting Instruction Embedding 

A single conditional statement may introduce an new ele- 
ment in many paths turning them all different (as in codes 
(a),(c) in Fig. 1). For compensation we could measure a par- 
tial similarity between paths (e.g., 80% of path p is contained 
in q), but a single instruction (such as the conditional state- 
ment in (c)) would then still weaken all path similarities. 
We therefore propose a different approach in the fashion of 
edit distance, where we pay a constant cost for a missing 
path element, which may then be used in many paths. 


Given two SVUP representations P,Q, we compare all paths 
p€ P,q€ Q to identify missing path elements 6(p,q) € I*. 
(For example, 6(vn/loop/if/fn,vn/loop/fn)=if ). Many dif- 
ferent path combinations may lead to the same 6(p,q), so 
by m(d) we denote the set of all pairs (p,q) € P x Q with 
6(p,q) = d. The matching of paths in P to paths in Q is done 
iteratively: In the first iteration, we match all paths from P 
and Q that are identical (5(p, q) = ()). To match the SVUP 
representations at minimal cost, in subsequent iterations we 
identify the missing path element d that unifies the largest 
number of paths (that is, choose d = argmax,|d(«)|). Be- 
fore entering the next iteration, all pairs are removed from 


TwoLoopQuickExit 
OneLoopQuickExit 
OneLoopGuardedElse 
OneLoopGuardedlf 

r— Disordered 


— OneLoopQuickExitTemp\ 


Nonsense 


Empty 


Figure 4: Dendrogram for source codes of Fig. 1 
based on F)-measure on SVUP representation. 


m/(-) that have been assigned already. We reflect the cost of 
adding a missing path element by adding it as a virtual path 
to the SVUP P or Q (depending on where it was missing). 


Table 1 shows a detailed example. The table on the left 
shows the paths belonging to SWUP of P and Q. All paths 
have been numbered for easier reference. The initial map 
m/(-) is shown on the top right; for instance, the path element 
if unifies path #1 of P with #6 of Q, as well as #2 of P 
with #7 of Q. At the bottom right each line corresponds 
to an iteration of the matching process. In step 0, paths 
#5 and #9 are already identical. In the second iteration 
the path element if is chosen from map m(-), because it 
unifies 2 paths (all others only one). We have thus matched 
6 paths in total (step 1 of bottom right table), and only 
#3, #4, #8 remain unassigned (gray). The map d(-) now 
offers two alternatives (assign and declare, both |m/(-)| = 
1), we arbitrarily choose assign as the second missing path 
element for the third iteration. This leaves only path #4 
unassigned. The choice of m(declare) has covered paths #4 
and #8, so we remove all pairs from m(-) containing any 
of these (already covered) paths. This ends the matching 
phase (all |m/(-)| = 0). We had to add the first missing 
path element if to paths in Q; likewise we had to add the 
second path element assign to P to match #3 against #8. 
As a penalty for the missing path elements we add them to 
the respective SVUP, that is, P becomes {1, 2, 3, 4,5, assign} 
and Q = {6,7,8,9, if}. Taking the established identity of 
paths into account, this gives us an F) value of 


IPNQ| 4 IPN@| 4 16/30 8 
= ee = F, =2. ee 
P= PI a” IQ 57 S01 “94/30 11 


Fig. 4 shows the resulting dendrogram. Now OneLoop- 
Guardedlf (Fig. 1(c)) became much more similar to One/T- 
woLoopQuickExit (Fig. 1(a-b)). But we are still dissatisfied 
with the high similarities towards Disordered: it uses the 
same set of embedded instructions (causing the high simi- 
larity), but the variable usage is mixed up. The instructions 
may look identical, but the author did not get the role of 
variables right and that should degrade the similarity. 


3.4 Matching Variables 


So we finally revisit the simplification of definition 2. We 
have hypothesized that similar solution approaches use vari- 
ables at specific places in the code skeleton. Up to now, 
we have mixed the usage paths of different variables in the 
SVUP representation. This will be sorted out next. 
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Definition 3. Let my, f) P7> P,P & {p|\p = 
(v,t1,-.-,in, f) € P} be a filtering function that returns 
only those paths that refer to variable identifier v and 
function identifier f. From a VUP representation P with 
a set V of variable identifiers and F’ of function identi- 
fiers, we obtain a GVUP representation (grouped VUP) 
P' = {mw,p)(P)|v € Vf € F}. That is, a GVUP is a set 
of VUPs, one VUP set for each variable in each function. 


For instance, the code in Fig. 1(b) consists of 4 variables in 
one function, so the GVUP representation is a set of four 
VUP representations (one set for each variable). Now, as 
the GVUP representations P and Q of two source codes 
are sets of sets, how do we generalize the F\ calculation? 
Consider the following example, where we use abbreviated 
instructions I = {a,b,c}: 


P {{x/a/e/f,x/b/c/f},{y/a/f,y/c/f},{2/o/f}}, 
Q = t{x/a/f,x/c/f}, {y/b/f}} 


Formally, a direct calculation of |P M Q| yields 0 as none of 
the sets in P is contained in Q. But this does not meet our 
intention. Note that variable y in P plays the role of x in 
Q (same for z in P and y in Q). Variables may be renamed 
without changing the semantics, in fact |P M Q| should be 
2. The desired semantics is to match variable and function 
names appropriately, rename them accordingly and calculate 
the F, measure on the obtained Uxe-p X and Uy eg Y- 


How do we match the sets X € P against Y € Q? All paths 
in X or Y refer to the same variable and function name, so 
we can safely transform the VUP to a SVUP representation 
and use the measure from section 3.3 to construct a pair- 
wise cost matrix. We employ the Munkres algorithm [8] to 
find the optimal assignment based on this cost matrix. As 
the Munkres algorithm performs a 1:1 least-cost assignment, 
some variables may not get assigned. We match them after- 
wards in a second pass to the least costly counterpart. (This 
allows us to match multiple identifiers in one program to the 
same variable in another, as required when comparing codes 
(a) and (b) of Fig. 1.) 


As an example, for the abovementioned P and Q we would 
create a cost matrix of all pairwise F\-values (P-paths in 
rows, Q-paths in columns): 


0.67 0.50 
1.00 0.50 
0.50 1.00 


The Munkres algorithm optimally assign two of the P-paths 
to two of the Q-paths (bold face). The third P-path is then 
associated with the first Q-path, which matches best ac- 
cording to the higher F\-value. We have thus assigned the 
variables x and y of P to variable x in Q and may think 
of renaming all these variables in both codes to, say, u. In 
the same fashion we may rename the variables of the second 
assignment to v; reunifying all SVUPs leads us to: 


Po={ u/a/e/f,u/b/c/f,u/a/f,u/c/f, v/b/f}, 
Q'={ u/a/f,u/e/f, u/b/f} 
The final similarity is obtained by applying the calculations 


of Sect. 3.3 to both sets P’ and Q’ (this time with differ- 
ent variable names rather than just generic variable names 
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Figure 5: Dendrogram for source codes of Fig. 1 
based on F)-measure on GVUP representation. 


vn). Although the example did not include different function 
names, it works in the same way. Fig. 5 shows the resulting 
dendrogram. As desired, the similarity of code Disordered 
(Fig. 1(d)) is now worst among all solutions that follow the 
same solution approach. 


4. EXPERIMENTAL EVALUATION 


We started to evaluate the proposed approach on real stu- 
dent code submissions and demonstrate its performance on 
some examples. The submissions were inspected manually 
and grouped by approach (defining ground truth). While the 
author of an exercise may have a specific solution in mind, 
a group of students usually finds multiple ways to solve the 
exercise. The real data contains nearly identical submis- 
sions (potential plagiarism) as well as submissions that ap- 
pear somewhat chaotic, have superfluous declarations and 
calculations, or contain artefacts indicating a change in the 
solution approach over time. Most codes can nevertheless 
be assigned to solution approaches, but the strong variation 
in novice’s code length misleads edit- and tree-distance such 
that resulting dendrograms do not match the approaches. 


We show results for two exercises, the first exercise asks for 
the most frequently occurring element in an integer array. 
To inspire the students to elaborate on different solutions, 
an additional restriction was given that all values v in the 
array satisfy 0 < v < 10000. The two main solution ap- 
proaches were: (1) iterate over all elements in an outer loop, 
count the frequency of the current element in an inner loop, 
remember the element that occurs most often; (2) instanti- 
ate an array of size 10000 (associating a counter with each 
possible element in the original array), increment the re- 
spective counter while looping over the array and identify 
the largest entry in the counter array. Apart from these 
two dominating solutions, three solutions (3) sorted the ar- 
ray first, such that identical values are grouped together, 
which simplifies frequency counting in a single loop over the 
array.? Finally, (4) there are some exotic solutions which 
may be considered as a mixture of the discussed solutions. 
Possibly students became aware of the other approaches by 
discussing approaches among each others, but had difficul- 
ties in solving the task and switched back and forth between 
them. Usually, elements of all other solutions can be found 
in them. Fig. 6 shows average-linkage hierarchical clustering 


3>However, when this exercise was handed out, sorting al- 
gorithms were not yet discussed (so this solution required 
background knowledge or a student’s own initiative). 
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Figure 6: Dendrogram for exercise ’most frequent 
element’ based on F\-measure (measure from Sect. 
3.2, 3.3, and 3.4 from top to bottom). 


results for the similarities from Sect. 3.2, 3.3, and 3.4 from 
top to bottom. The ground truth is shown by means of the 
node colors: 1: black, 2: blue, 3: red, 4: gray, incomplete 
/ unfinished: cyan. While the clusters in the top clustering 
mixes even the two large approaches (1) and (2), the clus- 
tering in the middle is much better but does not separate 
solution (3). This is only achieved by the GVUP-clustering 
(bottom), which corresponds best to the ground truth. The 
bubblesort used in (3) uses very similar loops as the main 
task, so it was crucial to distinguish the role of variables as 
done in the GVUP-approach. 


A second exercise deals with the identification of happy num- 
bers*. The calculations require the sum of squares of each 
digit and the submissions differ mainly in the way how this 
is solved. The intended solution (black) was a loop that suc- 
cessively splits the last digit (using integer division and mod- 
ulo). Another popular solution is based on string-conversion 
(blue), a few (somewhat restricted) approaches (red) dealt 
with different numbers of digits individually (avoiding a 
loop). Again, the average-linkage clustering for all three 
proposals is shown in Fig. 7 and the GVUP-approach sepa- 
rates them best. The modulo-approaches (black) subdivide 


“nttps://en. wikipedia. org/wiki/Happy_number 


iil 


Figure 7: Dendrogram for exercise ’happy number’ 
based on F)-measure (measure from Sect. 3.2, 3.3, 
and 3.4 from top to bottom). 


into two major branches, which differ in the way intermedi- 
ate variables are used. Approaches that use the same kind of 
utility variables (e.g. to store digits, square of a digit, etc.) 
are closer matches than approaches that use different sets of 
utility variables. 


5. CONCLUSIONS 


Assessing the variety in student’s solutions to a program- 
ming exercises without having to inspect all codes manu- 
ally can help a lecturer in many ways. We have proposed 
a measure that captures how variables and instructions are 
coupled by means of variable usage paths, and use this fin- 
gerprint to match code from different solutions while at the 
same time being tolerant to code repetitions. The approach 
needs to be evaluated further, but the first results appear 
promising. The nature of the comparison is set-based, which 
allows us not only to assess similarity (using F), but also to 
use recall and precision. This enables further applications, 
for instance, we may assess partial solutions by the degree 
how many elements of a complete solution they contain (us- 
ing recall only) or assess the student’s degree of program- 
ming maturity by investigating the amount of superfluous 
statements (using precision only). 
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ABSTRACT 


Bayesian Knowledge Tracing, a model used for cognitive 
mastery estimation, has been a hallmark of adaptive learn- 
ing research and an integral component of deployed intelli- 
gent tutoring systems (ITS). In this paper, we provide a brief 
history of knowledge tracing model research and introduce 
pyBKT, an accessible and computationally efficient library 
of model extensions from the literature. The library provides 
data generation, fitting, prediction, and cross-validation rou- 
tines, as well as a simple to use data helper interface to ingest 
typical tutor log dataset formats. We evaluate the runtime 
with various dataset sizes and compare to past implementa- 
tions. Additionally, we conduct sanity checks of the model 
using experiments with simulated data to evaluate the accu- 
racy of its EM parameter learning and use real-world data 
to validate its predictions, comparing pyBKT’s supported 
model variants with results from the papers in which they 
were originally introduced. The library is open source and 
open license for the purpose of making knowledge tracing 
more accessible to communities of research and practice and 
to facilitate progress in the field through easier replication 
of past approaches. 


Keywords 
Bayesian Knowledge Tracing; Intelligent Tutoring Systems; 
Educational Software; Python Library 


1. INTRODUCTION 


Knowledge Tracing [6] has been a well researched approach 
to estimating students’ cognitive mastery in the context of 
computer tutoring systems [23]. Tutoring systems take a 
problem-solving, or active approach to learning [2} [I] that 
often resembles the personalized mastery learning approach 
researched by Bloom [4]. The model was not originally de- 
scribed using a particular statistical framework; however, 
the mathematical expressions in the original work are consis- 
tent with Bayes Theorem [26], and the canonical model was 
subsequently coined Bayesian Knowledge Tracing (BKT). 
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In spite of its growing popularity in the research community, 
accessible and easy to use implementations of the model and 
its many variants from the literature have remained elusive. 
In this paper, we introduce pyBKT| a modernized Python- 
based library, making BKT models and respective adaptive 
learning research more accessible to the community. The 
library’s interface and underlying data representations are 
expressive enough to replicate past BKT variants and allow 
for new models to be proposed. The library is designed with 
data helpers and model definition functions, allowing for 
convenient replication and comparison to BKT model vari- 
ants and subsequently better scientific progression and eval- 
uation of new state-of-the-art knowledge tracing approaches. 


The Bayesian Knowledge Tracing model can be described 
as a Hidden Markov Model (HMM) with observable nodes 
representing students’ known binary problem response se- 
quences obs; and hidden nodes representing students’ latent 
knowledge state at a particular time step t. Using expecta- 
tion maximization, pyBKT fits the learn (transmission pa- 
rameter), and guess, and slip (emission) parameters from 
historical response logs, with the parameters defined below. 


prior = P(Lo) 


learn = P(T) = P(Lt41 = 1]Li = 0) 


guess = P(G) = P(obs; = 1|L; = 0) 


slip = P(S) = P(obs; = 0|Lt = 1) 


Note that while P(Zo) denotes the prior parameter, we also 
define P(L+) as the probability that the student has mas- 
tered the skill at time step t. Bayesian Knowledge Tracing 
updates P(L;) given an observed correct or incorrect re- 
sponse to calculate the posterior with: 


P(Lt)(1 — P(S)) 
P(Lt)(1 — P(S)) + (1 — P(Le)) P(@) 


P(Lilobs: = 1) = 


P(L:)P(S) 
P(Lt)P(S) + (1 — P(Lt))(1 — P(G)) 
The updated prior for the following time step, which incor- 


porates the probability of learning from immediate feedback 
and any other instructional support, is defined by: 


P(Le41) = P(Li|obsz) + (1 — P(Lelobst)) P(T) 


P(L,|obs; = 0) = 


The standard BKT model assumes no forgetting: 
P(F) = P(Li+1 = O|Li = 1) = 0 


https://github.com/CAHLR/pyBKT 
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2. RELATED WORK 

Since the introduction of the standard BKT model by Cor- 
bett and Anderson [6], many variants of BKT have been 
proposed which multiplex, or condition the four parameters 
of prior, learn, guess, and slip on different factors such as 
question type or student. KT-IDEM (Item Difficulty Ef 
fect) captured item performance variance within skill 
by allowing different guess and slip values to be fit per item. 
Baker et al. proposed a hybrid model using regression to 
help determine if a student’s response was a guess or a slip 
based on context. 


Other modifications focused on conditioning the learn rate 
and prior knowledge parameter [19]. Yudelson et al. 
explored student-level parameter individualization, finding 
that the learn rate provided better predictive performance 
than prior when individualized to each student. Learn rate 
has also been conditioned based on the item [17], educational 
resource (e.g., a video in an online course) [21], or type of 
instructional intervention seen by the student at each oppor- 
tunity or order in which items or resources were seen 
18}. The assumption of no forgetting was relaxed in Qiu 
et al. [25], finding that response correctness decreased based 
on time elapsed since the last response. This decrease was 
better modeled in BKT with conditional forget rates than 
with increased slip rates. Gonzdlez-Brenes et al. [9] created 
a hybrid BKT model that factored in a variety of features, 
including response time. An overview of other variations of 
BKT and logistic approaches to learner response prediction 
can be found in Peldnek [23]. 


Neural network approaches to knowledge tracing have gained 
in momentum since the introduction of Deep Knowledge 
Tracing (DKT; [24]). Some have explored the reasons for 
DKT’s apparent accuracy improvement as compared to BKT 
and have attributed its success to its high dimensional hid- 
den space and ability to observe interleaved skills in a single 
model [14]. However, most papers using neural approaches 
compare only to the standard BKT model proposed in the 
1990s and not to the more modern variants. A notewor- 
thy result of caution was reported in Khajah et al. [Ii], in 
which it was found that simply enabling the forget parame- 
ter of standard BKT led to performance on par with DKT 
on several datasets. 


There has been a brief history of BKT implementation frame- 
works. Several BKT variants have used Kevin Murphy’s 
Bayes Net Toolbox (BNT) for MATLAB [15], with a subse- 
quent wrapper for that toolbox releases catering to knowl- 
edge tracing [5|. Yudelson et al. produced a C++ im- 
en of BKT with a command line interface that 
included support for individualized parameters. Finally, Xu 
et al. created a C++ implementation of BKT with a 
MATLAB interface and support for parallelization and con- 
ditioning of all parameters based on both problems and pas- 
sive resources (e.g., a learning rate for a video). 


3. PYBKT LIBRARY DETAILS 

The pyBKT library builds off of xBKT developed by Xu 
et al. [29] and is released under an MIT license. The library 
is compatible with all platforms (Linux, Windows, Mac OS), 


2 https: / /github.com/myudelson/hmm-scalable 


primarily utilizing NumPy [28] for computation. It is avail- 
able on the Python PyPi repository, with installation ac- 
complished through a pip one-liner: pip install pyBKT. To 
increase performance, we additionally supply routines which 
utilize C++ libraries and Eigen/LAPACK, which is an op- 
timized linear algebra package with OpenMP support. This 
accelerated version requires a C++ compiler and is currently 
only tested on Linux and macOS. 


We created a Scikit-learn style [22] Model class abstraction 
and accompanying data helpers that further facilitate the 
accessibility and expressive power of pyBKT. With one-line 
fit, predict, evaluate, parameter initialization, and cross- 
validate methods, pyBKT offers ease of use in ingesting re- 
sponse data and applying BKT models and supported vari- 
ants. We explore the interface to these methods in the next 
subsection. We then detail the internal data structures, 
computations, and motivations behind the development of 
the two implementations of pyBKT, in pure Python and the 
accelerated C++/Python, along with runtime evaluation. 


3.1 Interface 

pyBKT’s interface is modeled after Scikit-learn’s accessible 
frontend interface for machine learning models [22]. The 
ease-of-use in the pyBKT Model class abstraction allows for 
increasingly expressive BKT code without the usability sac- 
rifices of past BKT libraries. We aim for the library to 
be easy to learn for beginners while still useful for experi- 
enced users conducting knowledge tracing research. Further, 
it provides a gateway into exploring multiple model exten- 
sions from the literature, which have been shown to be capa- 
ble competitors to DKT and able to address inequities 
in unmodeled differences in learning and prior ability be- 
tween students [7]. Supported BKT extensions include: KT- 
IDEM [20], KT-PPS (Prior Per Student) [19], BKT+Forget 
{11}, Item Order Effect [18] and Item Learning Effect [7] 
21). These model extensions are referred to as multigs, 
multiprior, forgets, multipair, and multilearn, respec- 
tively, in the model interface. 


The Model class abstraction supports creating, fitting, pre- 
dicting, cross-validating and evaluating BKT models using 
any combination of supported extensions. Additional fea- 
tures include specifying model parameter initialization be- 
fore fitting, custom cross-validation fold assignment, and 
multiple accuracy and error metrics - including support for 
generic user-defined or Scikit-learn imported metrics. Com- 
mon dataset formats are made easier to ingest through auto- 
matic detection of familiar column headers seen in Cognitive 
Tutor and ASSISTments datasets [10]. Defaults for all 
customizable parameters such as random seed, paralleliza- 
tion, model variants, and evaluation metric(s) are provided 
when they are not specified. 


We demonstrate a few of the library’s basic capabilities in 
parameter initialization, fitting, and parameter output in 
the below code snippet using the learned parameters of the 
”Polynomial Factors” skill from the 2009-2010 ASSISTments 
datasef?] Note that all parameters of all skills found in the 
dataset are fit unless otherwise specified. 


.google.com/site/assistmentsdata/ 


home/assistment-2009-2010-data 
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>>> from pyBKT.models import Model 

>>> model = Model() 

>>> model.fit(data_path = ’assistments.csv’) 
>>> model.params().loc[(’Polynomial Factors’)] 
param class value 

prior default 0.17452 

learns default 0.13378 

guesses default 0.25502 


The internal data helper functionality converts any response- 
per-row comma or tab separated file into the internal py- 
BKT data format. It is designed to convert input columns 
using default column mappings for skill name, student iden- 
tifiers, correctness, etc. automatically for Cognitive Tu- 
tor/ASSISTments data and configurable with one line for 
any other dataset. It provides increased flexibility to the 
user, allowing for consistency across fit and predict/evalu- 
ate phases. 


The included evaluation metrics are root-mean squared error 
(RMSE), accuracy with a threshold of 0.5, and area under 
the ROC curve (AUC). Custom metrics in the format of two- 
parameter Python functions are supported for evaluation 
and cross-validation, such as the regression or classification 
metrics from sklearn.metrics or keras.metrics. 


The cross-validate function provides a one-line interface for 
fitting and evaluating any combination of model variants 
with one or more error metrics. For the following exam- 
ple, we specify a particular column to use for the multilearn 
(answer type) along with a multigs model trained on the de- 
fault template ID for ASSISTments. A skill or combination 
of skills can be specified along with the seed and number of 
folds (optional). 


>>> model = Model(seed = 42, parallel = True) 

>>> model.crossvalidate(data_path = ’assistments.csv’, 
skills = [’ Circle Graph’, ’Box and Whisker’], 
multigs = True, multilearn = ’answer_type’, metric = 
[’auc’, sklearn.metrics.mean_absolute_error]) 

skill mean_absolute_error auc 

Circle Graph 0.41565 0.72782 

Box and Whisker 0.33906 0.67991 


3.2 Internal Implementation Details 

The pure Python and C++/Python implementations of py- 
BKT both make use of optimized programmatic methods, ef- 
ficient internal data and model representation, multithreaded 
model fitting, and optimized linear algebra libraries. The 
model fitting consists of a typical Expectation Maximiza- 
tion (EM) function for a Hidden Markov Model performing 
forward and backward passes over the sequential data to 
continuously update BKT parameters. We implement these 
passes using a parallelized iterative dynamic programming 
approach on the input data. 


3.2.1 Model Representation 

In the context of model variants with multiple learn, guess, 
slip, or forget rates, a subscript P(T;), P(Gi), P(Si), P(Fi) 
denotes the corresponding probability, or rate, for class 1 = 
1...m, respectively. 


Initial and fit BKT model parameters are represented using 
a Python dictionary. Inside of this dictionary, we store A, 
which is a collection of matrices with each 2x2 matrix cor- 
responding to the learning and forgetting probabilities for 
each learn class in order to aid in efficient matrix multipli- 
cation during fitting. A has the format where m is the total 
number of learn rates: 


[Ee P(Fi) en pa 


P(T:) P(AF\) P(Tm) CE, l 


P(7AFy) 


We define a as a set of 2-length vectors each corresponding 
to [P(-Lz), P(L)] for all time steps for a specific student. 


Similarly, zo stores information about the prior, in the for- 
mat of P(-Lo), P(Lo). 


3.2.2 EM and BKT Algorithm 


We use Expectation Maximization to fit model parameters, 
shown to provide desirable convergence properties, given 
plausible initial parameter values [16]. Inside the EM and 
inference algorithms, we use several intermediate data struc- 
tures and vectorization to improve computational efficiency 
in fitting the models. To calculate aft + 1] given alt], we 
multiply it by the part of the learn/forget transition ma- 
trix A corresponding to the learn class of time step t. We 
element-wise multiply by the vector likelihoods which con- 
sists of [P(G), P(=S)] or [P(4=G), P(S)] for the correspond- 
ing guess class of time step t, depending on whether the 
student answers correctly or not, respectively. Finally, nor- 
malizing this vector results in a[t+ 1]. We demonstrate the 
algorithm for an example iteration of the BKT algorithm 
with learn class 1, guess class 1, and an incorrect response 
observed (obs = 0) at time step t. 


aes] = FS? ech [AG] [RSP 


= eae *« P(=L¢|obst) + P(F1i) * ra | 
~ | P(T1) * P(>Lt|obst) + P(AF) * P(Lilobse) 


At the end of this a calculation, we perform the E-step of 
the EM algorithm by recursively calculating an expectation 
y for each time step by backtracking through the learned 
latent states. We can then take the global average of the ex- 
pectations of the learn/forget transition matrices, guess/slip 
vectors, and priors during the M-step and use these as the 
parameters for the next iteration of EM. In terms of the 
number of students S and the typical sequence length for 
each student 7’, the model fitting algorithm’s asymptotic 
time complexity for standard BKT is @(T'S). 


3.2.3. C++/Python Implementation Details 

We use a C++ extension to perform the EM iterative up- 
dates and matrix multiplication for the model fit and pre- 
diction process. This allows us to use efficient linear algebra 
libraries in cH and benefit from greater support for mul- 
tithreading through OpenMP. 


We use Eigen to perform the matrix operations. There are 
many technical advantages of using Eigen with a linear al- 


‘Boost was previously used as a connector between Python 
and the C++ extension, but it has been deprecated since 
pyBKT 1.2.2, resulting in a 3-5x performance increase. 
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gebra heavy model such as pyBKT. Eigen provides efficient, 
thread-safe matrices and arrays, while being a relatively 
portable package distributed along with pyBKT. It allows 
for lazy evaluation, expression templates and compiler opti- 
mizations. 


We use OpenMP for parallelizing the demanding model fit- 
ting process. OpenMP is a universally accepted multithread- 
ing library for C/C++ that exploits multicore processors 
with low overhead. With a shared memory space for forked 
threads, OpenMP avoids overhead for inter-process com- 
munication (IPC) unlike Python’s multiprocessing. With 
Eigen’s explicit support for OpenMP-based multithreading, 
heavy matrix operations and iterative processes are further 
optimized in pyBKT. 


3.2.4 Pure Python Implementation Details 

We wish to maintain the accessibility of pyBKT across all 
platforms while maintaining as much efficiency as possible. 
To do this, a pure Python implementation, without C++ 
extensions, is included. This implementation provides quick 
access to any user, including on Windows, wishing to run 
a BKT model without the hassles of compilation and com- 
plex dependencies. We relax this version of the library’s 
requirements to include mostly native modules along with 
the widely supported NumPy. 


NumPy is used for the matrix operations in the pure Python 
build of pyBKT as it is the most efficient and widely used 
numeric computational library in Python. Technically, it 
provides impressive single-threaded performance. Similarly 
to other optimized mathematical libraries such as Eigen, 
NumPy employs code vectorization, efficient memory map- 
ping techniques for sparse matrices, and compiler optimiza- 
tions. 


Since NumPy is primarily a single-threaded application with 
little support for multicore scaling, we use Python’s native 
multiprocessing library for parallelizing the model fitting. 
It provides native CPython support for multicore scaling to 
bypass the Global Interpreter Lock (GIL), which allows only 
one running thread of execution within a process. Although 
different Python implementations (i.e JPython) exist to dis- 
able the GIL or remove the memory overhead, we use a 
native module for simplicity and speed. 


3.3 Runtime Evaluation 

We compare the runtime performance of the pure Python 
and C++/Python implementations of pyBKT on five typi- 
cal model fitting and prediction tasks. We present two sets 
of tasks, fitting and prediction on synthetic data, that addi- 
tionally showcases the way in which the runtimes scale with 
the size of input data for both implementations of pyBKT. 
Each of the tasks are averaged over several runs for both 
implementations of pyBKT on a machine with 2 x Intel(R) 
Xeon(R) CPU E5-2620 v3 CPUs at 2.4Ghz with 256GB of 
system RAM. The results are shown in Table 1. 


We evaluate the runtimes using two metrics. The scaling 
factor is defined as the ratio of the runtime of the larger 
input and the smaller input for a set of prediction or fitting 
tasks. The speedup is defined as the ratio of the runtime 


of our C++/Python implementation to our pure Python 
implementation. 


The first four tasks perform synthetic data generation for 
500 students and 5,000 students respectively with a sequence 
length fixed at 100 followed by prediction or fitting. These 
tasks illustrate a typical medium and large workload for 
model fitting and prediction tasks. The generated synthetic 
data is fit or predicted using a standard BKT model. It is 
clear that as the number of students scales, the pure Python 
implementation of pyBKT performs and scales more poorly 
with the number of students. The C++/Python implemen- 
tation shows a nearly 150-600x speedup for fitting and 15- 
30x speedup for prediction as compared to the pure Python 
version. In comparison to its predecessor xBKT (MAT- 
LAB), the C++/Python version of pyBKT gains a 3-4x 
speedup across all fitting and prediction tasks. In Xu et al. 
[29], it is noted that xBKT outperforms BNT by 10,000x, 
which suggests a 30,000-40,000x speedup of pyBKT as com- 
pared to BNT. 


The final task performs a cross-validated prediction task for 
a selected skill in the Cognitive Tutor dataset. We use a vari- 
ety of models (standard, multigs, multiprior) to test predic- 
tive accuracy and measure its runtime. This task is around 
65x slower in the pure Python implementation. 


While the runtimes and the overall scaling of the pure Python 
port with respect to the size of input data are significantly 
poorer for each task, that is an expected trade-off with re- 
gards to accessibility and portability. For an end-user that is 
training and testing moderately-sized BKT models or eval- 
uating models, they would benefit from a portable and uni- 
versal BKT model which can handle a moderate input size 
with relative efficiency. For heavier research-oriented or 
production-oriented tasks, the C++/Python implementa- 
tion is recommended since it generally performs much more 
efficiently. 


4. DATA SUFFICIENCY ANALYSIS 


We examine the data sufficiency requirements of the stan- 
dard BKT model by exploring trade-offs between input size 
and parameter error and mastery estimation accuracy. We 
define the input size as the magnitude of the number of stu- 
dents and the average sequence length. Through the first 
analysis, practitioners may gain an intuition for the min- 
imum cohort size and minimum number of questions an- 
swered per student per skill to effectively apply BKT. Our 
second analysis in this section focuses on mastery estimation 
accuracy, also using synthetic data. This analysis depicts 
how the worst-case expected mastery estimation accuracy 
decreases as a function of sequence length for a given set of 
prior, guess, and slip parameters. 


For the following analyses, we generate synthetic data, both 
responses and mastery states, from pyBKT using ground 
truth parameter values set to common values seen for Alge- 
bra skilld?| In doing so, we are able to calculate the error of 
the fit parameters and accuracy of the mastery estimation. 


> prior=0.08, guess=0.15, slip=0.05, forget=0 and learn=0.3 
for the first analysis and variable for the second 
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Table 1: Comparison of runtimes, scale factor and speedup between Python and C++/Python implementations of pyBKT. 


Test Description Pure Python C++/Python Speedup 
Runtime | Scaling | Runtime | Scaling 

Synthetic Data Generation, Model Fit (500 students) 160.12s 1.07s 149.64x 

Synthetic Data Generation, Model Fit (5,000 students) 1,596.30s | 9.97x 2.62s 2.45x 609.27x 

Synthetic Data Generation, Model Predict (500 students) 8.02s 0.50s 16.04x 

Synthetic Data Generation, Model Predict (5,000 students) | 67.08s 8.36x 2.42s 4.84x 27.71x 

Cross-validation, Cognitive Tutor 320.28s - 4.79s - 66.86x 


4.1 Synthetic Model Fit Accuracy 

The synthetic generation of data is performed for input sizes 
from 10 to 200 students for all sequence lengths from 2 to 
35. While 200 is not an uncommon number of students to 
have in a cohort, more than 30 responses to a single skill 
would be unusual for a student. Each combination of input 
size is averaged over five fits, each of which includes the best 
model over 20 random EM fit initializations. 


The error of the model’s fit parameters are analyzed using 
the mean absolute percentage error (MAPE) in relation to 
predefined ground truth generating parameters. We plot the 
fitting error of the learn, slip, guess, and prior parameters as 
a function of the number of synthesized students (Figure 1, 
left) and length of synthesized response sequences for each 
student (Figure 1, right). While all data points cannot be 
visualized on a single plot, we show the data points for the 
prior parameter as an example. 


For all parameters, there is a clear negative and exponential 
error decay with respect to the number of students. This 
is consistent with an expected asymptotic behaviour when 
increasing the number of students in the fitting procedure. 
Learn and slip parameters asymptote at around 50 students 
while guess and prior do so after 100, given a sequence length 
of 10. 


There seems to be a slowly decreasing linear relationship 
between the typical sequence length and parameter fit er- 
ror, with the prior parameter showing the greatest improve- 
ment in MAPE. These analyses show that there is not nearly 
as much benefit to fitting error by increasing the sequence 
length (i.e., giving students more problems) as there is by 
increasing the number of students. 


Figure 1: Mean absolute percentage error (MAPE) of fit pa- 
rameters as a function of number of students (left) and se- 
quence length (right). 
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4.2 Synthetic Mastery Estimation Accuracy 
We look to evaluate the worst-case accuracy of the standard 
BKT model’s mastery estimation, its most common task 
within a computer tutoring system, using a mastery thresh- 
old of P(Lz) > 0.95 on simulated problem solving sequences. 
We use the same predetermined ground truth BKT param- 
eters and data generation methodology as in the previous 
subsection with the exception of the learn rate, which is set 
dynamically in this analysis. 


It is known that the probability of mastery will converge to 
1 — forgets in a finite number of time steps given a learn 
rate > 0 [27]. This means that cognitive mastery estimation 
accuracy will increase with respect to sequence length, as 
student mastery state becomes more homogeneous. 


In order to model the worst-case mastery estimation of the 
model, we find the learn rate that will produce an average 
probability of mastery close to 0.5 across students at the 
final time step 7 for all our chosen sequence lengths, thus 
preventing a trivial estimation of a majority mastery state. 
We find the learn rate via grid search with a granularity of 
0.001 since this cannot be solved analytically [27]. 


The mastery estimation accuracy exponentially decays up- 
ward with respect to sequence length with a Pearson cor- 
relation Riog(acc) = 0.73 as shown in Figure 2. The mas- 
tery estimation accuracy in our analysis can be observed to 
asymptote around a sequence length of 15. This suggests 
that the worst-case mastery estimation accuracy scenario 
can be mitigated, given our chosen predefined parameter 
values, with an average response sequence length per skill of 
15 or greater. 


Figure 2: Worst-case accuracy of mastery estimation as a 
function of sequence length for the standard BKT model. 
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5. EVALUATION OF PYBKT MODELS 


To gauge the validity of the BKT models and extensions 
supported by pyBKT, we perform predictive model evalua- 
tions replicating those found in prior literature. Models were 
evaluated by performing five-fold cross-validation on the AS- 
S1STments 2009-2010 Skill Builder dataset, the Cognitive 
Tutor 2006-2007 Bridge to Algebra dataset’| and special- 
ized datasets from Feng et al. [8] and Piech et al. [24] in or- 
der to validate more specific model extensions. While there 
are slight differences between our results and the results of 
other papers, we believe this is due to our random parameter 
initialization and small differences in fitting methods. Our 
complete model evaluations, data, and results are located in 
a pyBKT examples repository’ | 


5.1 Standard BKT and BKT+Forget 


In general, the predictive accuracy of the models generated 
by pyBKT are in line with what others have observed from 
BKT. For instance, in Khajah et al. [1], they found that 
by fitting the basic BKT model on the train/test split pro- 
vided by Piech et al. from the 2009-2010 ASSISTments 
dataset for each skill, they obtained an AUC of 0.73, whereas 
pyBKT predicts with an AUC of 0.76. Similarly, when Kha- 
jah et al. added the forgets parameter into their model, 
they achieved an AUC of 0.83, exactly the same AUC we 
achieve using pyBKT. 


5.2) KT-IDEM 

When using the KT-IDEM model on the ASSISTments 2009- 
2010 Skill Builder dataset, pyBKT achieves an average AUC 
increase of 0.01932, which is very close to the 0.021 average 
increase reported in Pardos and Heffernan [20]. While the 
exact subset of data was not exactly specified in that paper, 
we used the ten skills with the most responses, using differ- 
ent template ids as the guess/slip classes as prescribed in 
[20]. Since the ASSISTments data has a very high average 
response to template ratio (~1,000), the KT-IDEM model 
performs very well compared to the standard BKT model 
using RMSE as the metric of comparison, being lower or 
equal in nine of the ten skills selected. 


5.3. KT-PPS 

The Prior Per Student model was applied to the ASSIST- 
ments’ Groups of Learning Opportunities dataset con- 
sisting of 42 problem sets. In Pardos and Heffernan {19}, it 
was found that the KT-PPS model performed better than 
the standard BKT model on 30 out of the 42 problems sets, 
as evaluated on the predictions of the last response of each 
student’s response sequence. This was achieved using a vari- 
ant of KT-PPS that models a high and low prior and assigns 
students to the high prior if they answer correctly on the first 
problem of the problem set, and to the low prior otherwise. 
The high prior was set to an ad-hoc value of 0.90 and the 
low prior to 0.15 in that work. 


The pyBKT replication of this model is done without true 
multiple prior modeling. Instead, when the multiprior op- 
tion is set to True, P(Lo) is set to 0 and a dummy time step 
is created at the beginning of the sequence. Three learn 
rates are created, the first corresponding to the high prior, 


“https: //pslcdatashop.web.cmu.edu/KDDCup 


https: //github.com/CAHLR/pyBKT- examples 
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the second to the low prior, and the third corresponding 
to the standard P(T) applied between all subsequent time 
steps. The initial values of these virtual priors were set to 
the ad-hoc values from [19]; however, since pyBKT does not 
support parameter fixing as of this writing, these parame- 
ters were learnable. With these settings, pyBKT’s KT-PPS 
performs better than standard BKT on 27 out of 42 of the 
problem sets. The small difference in prediction accuracy 
of this model may be attributable to the difference in the 
algorithm regarding fixed parameters, but the similarity in 
performance is promising. 


5.4 Item Order and Item Learning Effect 
Results from the Item Order Effect and Item Learning 
Effect papers were not focused on response prediction 
improvement. In fact, no prediction accuracy results were 
provided. Instead, the purpose of the models was to com- 
pare the learn rates of classes to flag effective and ineffective 
items and orders. The examples repo of pyBKT depicts 
such differences. Nevertheless, a modest 0.01 RMSE im- 
provement for both model variants was obtained compared 
to the standard BKT model. 


6. CONCLUSIONS 


We demonstrated pyBKT as a seamlessly installable, effi- 
cient, and portable Python library with model extensions 
such as KT-IDEM, KT-PPS, BKT+Forgets, Item Order Ef- 
fect and Item Learning Effect. The Model class abstraction 
in pyBKT provides an expressive way to interact with the 
BKT model extensions with ease, with one-line methods to 
create, initialize, fit, predict, evaluate, and cross-validate 
any combination of BKT model extensions. We measured 
the runtime of pyBKT to be nearly 3x-4x faster than its 
predecessor, xBKT, and nearly 30,000x faster than BNT, a 
standard BKT implementation. Through the analyses pre- 
sented, we established 50 as a reasonable number of students 
to achieve convergence to canonical parameter values with 
any average student sequence length and 15 as a reasonable 
sequence length to mitigate worst-case mastery estimation 
accuracy. Lastly, through real-world dataset analyses, we 
showed the validity of the model implementation through 
its agreement with past results using established software. 
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ABSTRACT 


Scoring essays is generally an exhausting and time-consuming task 
for teachers. Automated Essay Scoring (AES) facilitates the 
scoring process to be faster and more consistent. The most logical 
way to assess the performance of an automated scorer is by 
measuring the score agreement with the human raters. However, we 
provide empirical evidence that a well-performing essay scorer 
from the quantitative evaluation point of view are still too risky to 
be deployed. We propose several input scenarios to evaluate the 
reliability and the validity of the system, such as off-topic essays, 
gibberish, and paraphrased answers. We demonstrate that 
automated scoring models with high human-computer agreement 
fail to perform well on two out of three test scenarios. We also 
discuss the strategies to improve the performance of the system. 


Keywords 


Automated Essay Scoring, Testing Scenarios, Reliability and 
Validity 


1. INTRODUCTION 


Automated Essay Scoring (AES) system is a computer software 
designed as a tool to facilitate the evaluation of student essays. 
Theoretically, AES systems work faster, reduce cost in term of 
evaluator’s time, and eliminate concerns about rater consistency. 
The most logical way to assess the performance of an automated 
scorer is by measuring the score agreement with the human raters. 
The score agreement rate must exceed a specific threshold value to 
be considered as having a good performance. Consequently, most 
studies have focused on increasing the level of agreement between 
human and computer scoring. However, the process of establishing 
reliability should not stop with the calculation of inter-coder 
reliability, because automated scoring poses some distinctive 
validity challenges such as the potential to misrepresent the 
construct of interest, vulnerability to cheating, impact on examinee 
behavior, and users’ interpretation on score and use of scores [1]. 
Bennet and Bejar [2] have argued that reliability scores are limited 
in their reliance on human ratings for evaluating the performance 
of automated scoring primarily because human graders are fallible. 
Humans raters may experience fatigue and have problems with 
scoring consistency across time. Reliability calculations alone are 
therefore not adequate as the current trend for establishing validity 
[3]. A well-performing essay scorer from the quantitative 
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evaluation perspective is too risky to be deployed before evaluating 
the system’s reliability and validity. 


The initial attempt to discuss validity issues regarding automated 
scoring in a larger context of a validity argument for the assessment 
was made by Clauser et al. [4]. They presented several outlines of 
the potential validity threats that automated scoring would 
introduce to the overall interpretation and use. Enright and Quinlan 
[5] discussed how the evidence for a scoring process that uses both 
human and e-rater scoring is relevant to validity argument. They 
described an e-rater model which was proposed to score one of the 
two writing tasks on the TOEFL-iBT writing section. Automated 
scorer was investigated as a tool to complement to human 
judgement on essays written by English language learners. 


Several criticisms for Automated Essay Scoring (AES) system 
were highlighted in [6]. They argued that there were limited studies 
on how effective automated writing evaluation was used in writing 
classes as a pedagogical tool. In their study, the students gave 
negative reactions towards the automated assessment. One of the 
problems was that the scoring system favored lengthiness; higher 
scores were awarded to longer essays. It also overemphasized the 
use of transition words, which increased the score of an essay 
immediately. Moreover, it ignored coherence and content 
development as an essay could achieve a high score by having four 
or five paragraphs with relevant keywords, although it had major 
coherence problems and illogical ideas. Another concern is 
described in [7]. Specifically, knowing how the system evaluates 
an essay may be a reason why students can fool the system into 
assigning a higher score than what is warranted. They concluded 
that the system was not ready yet as the only scorer, especially for 
high-stakes testing, without the help of expert human raters. 


Most researchers agree that human - automated score agreement 
still serves as the standard baseline for measuring the quality of 
machine score prediction. However, there is an inherent limitation 
with this measurement because the agreement rate is usually 
derived only from the data used for training and testing the machine 
learning model. The aim of this paper is to highlight some 
limitations of standard performance metrics used to evaluate 
automated essay scoring model, using several input scenarios to 
evaluate the reliability and the validity of the system, such as off- 
topic essays, gibberish, and paraphrased answers. We show 
empirical evidence that a well-performing automated essay scorer, 
with high agreement rate between human-machine, is not 
necessarily ready for deployment for operational use, since it fails 
to perform well on two out of three test scenarios. In addition, we 
also discuss some strategies to improve the performance of the 
system. This paper begins with the explanation of the quantitative 
performance acceptance criteria for an automated scoring model 
from [1]. Then, we present the experiment settings, including the 
training algorithm and the essay features, for creating the model. 
Afterwards, we discuss the experiment results, model performance 
analysis, reliability and validity evaluation and the strategies for 
improvement, and finally, we conclude our work. 
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2. SCORE AGREEMENT EVALUATION 


According to Williamson et al. in [1], the following are some 
of the acceptance criteria used for evaluation of automated scoring 
with respect to human scores when automated scoring is intended 
to be used in conjunction with human scoring: 


1. Agreement of scores between human raters and computer 


Agreement between human scores and automated scores has been 
a long-established measure of the performance of automated 
scoring. This is to measure whether the agreement satisfies a 
predefined threshold. The quadratic weighted kappa (QWK) 
between automated and human scoring must be at least .70 
(rounded normally). It is important to note that the performance of 
automated scorer will rely on the quality of the human scoring. 
Therefore, the interrater agreement among human raters must first 
be reliable. 


2. Degradation from human-human score agreement 


The human-automated scoring agreement cannot be more than .10 
lower than the human-human agreement. This standard prevents 
the case in which automated essay scoring may reach the .70 
threshold but still be notably deficient in comparison with human 
scoring. In addition, it does not rule out the cases in which the 
automated—human agreement has been slightly less than the .70 
threshold, but very close to a borderline performance for human 
scoring. For example, a human-computer QWK of .69 and human- 
human QWK of .71. Such model is approved for operational use 
based on being highly similar to human scoring. Moreover, 
Williamson et al. stated that it is relatively usual to observe 
automated-human agreements that are higher than the human- 
human agreements for tasks that predominantly target linguistic 
writing quality (e.g. GRE Issue and TOEFL Independent tasks). 


3. Standardized mean score difference between human and 
automated scores. 


Another measure for association of automated scores with human 
scores is that the standardized mean score difference (standardized 
on the distribution of human scores) between the human and 
computer cannot exceed .15. The standardized difference of the 
mean is formalized as follows: 


7 — [as-Xul a 
SD4 .—SD3, 


where X45 is the mean of the automated score, Xj is the mean of 
the human score, SD%, is the variance of the automated score, and 
SD3 is the variance of the human score. 


3. EXPERIMENTS 
3.1 DATASET 


We used the Automated Student Assessment Prize (ASAP) 
dataset!, hosted by the Kaggle platform, as our experiment data. 
ASAP is the most widely used dataset to evaluate the performance 
of AES systems [8]. All the essays provided are already human 
graded. ASAP dataset consists of eight prompts, with varying score 
ranges for each prompt. Table 1 highlights the topics of each 
prompt in ASAP dataset. 


' https://www.kaggle.com/c/asap-aes 


Table 1 Prompts in ASAP Dataset 


ASAP Dataset Topics 

Prompt 1 The effects computers have on people 

Prompt 2 Censorship in the libraries 

Prompt 3 Respond to an extract about how the 
features of a setting affected a cyclist 

Prompt 4 Explain why an extract from Winter 
Hibiscus by Minfong Ho was concluded in 
the way the author did. 

Prompt 5 Describe the mood created by the author in 
an extract from Narciso Rodriguez by 
Narciso Rodriguez 

Prompt 6 The difficulties faced by the builders of the 
Empire State Building in allowing 
dirigibles to dock there 

Prompt 7 Write a story about patience 

Prompt 8 The benefits of laughter 


3.2 FEATURES EXTRACTION 


Each essay is transformed into a 780 dimension of features vector. 
We extract the essay features into two categories: 12 interpretable 
features, and 768 dimension of Sentence-BERT vector 
representation. Table 2 contains the essay features we used to train 
the scoring model. 


Table 2 Essay Features 


Type Description 


Answer Length (Character counts) 

Word count 

Average word length 

Count of "good" POS n-grams 

Number of overlapping tokens with the prompt 
Number of overlapping tokens (including 
Interpretable |synonyms) with the prompt 

essay features | Number of punctuations 

(12 features) | Spelling errors 

Unique words count 

Prompt — answer similarity score (SBERT 
representation) 

Prompt — answer similarity score (BOW 
representation) 

Language Errors 


Sentence- 
BERT The encoding of the essay using Sentence-BERT 
features pretrained model 

(768 dim) 


3.2.1 Interpretable Essay Features 

Six out of the twelve interpretable essay features are extracted from 
EASE (Enhanced AI Scoring Engine) library’, written by one of the 
winners in ASAP Kaggle competition. This features set have been 
proven to be robust [9]. EASE generates 414-length features. 
However, we exclude most of the features generated by EASE 
library that are mostly Bag-of-Words vectors. The other six features 
extracted from the text are the number of punctuations, the number 
of spelling errors, unique words count, similarity scores between 


? https://github.com/edx/ease 
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answer and prompt using S-BERT and BOW (Bag-of-Words) 
vector representations, and the number of language errors. 


The grammar feature is measured by the number of good n-grams 
in the essay. EASE library extracts the essay text into its POS-tags 
and compares them with a list of valid POS-tag combinations in 
English. Good POS n-grams are defined as the ones that separate 
high- from low-scoring essays, determined using the Fisher test 
[10]. Moreover, we count the number of language errors in an 
answer using Language Tool Python library*. Mechanics in a 
language include aspects such as the usage of punctuation and the 
number of spelling errors found in the answer. 


The average word length and long words count are used by Mahana 
et al. to estimate language fluency and dexterity [11]. Larkey also 
used the number of long words to indicate the complexity of term 
usage [12]. Unique words count feature is useful to estimate the 
richness of vocabulary in the answer. 


The relevance factor of an answer combines two features from 
EASE library, which are related to the degree of tokens overlap 
between the prompt and the answer, including their synonyms. Two 
additional features are the cosine similarity measurement between 
the answer and the prompt, both using the Sentence-BERT and the 
BOW representation. 


3.2.2. Sentence-BERT representation 

Sentence-BERT, introduced by Reimers and Gurevych (2019), is a 
modification of pretrained BERT network using Siamese and triplet 
network [13]. It converts a text into a 768-dimension feature vectors 
and produces semantically meaningful sentence embedding. The 
embedding result can then be compared using cosine-similarity. 


3.3 MODEL TRAINING 


We train the regression models using Gradient Boosting 
algorithms, with 80% training data and 20% testing data (using 5- 
fold cross-validation). We use Quadratic Weighted Kappa (QWK) 
[14] score as the evaluation metric, which measures the agreement 
between system predicted scores and human-annotated scores. 
QWK is the standard evaluation metric to measure the performance 
of an AES system [1]. Although ASAP dataset has 8 prompts, we 
trained 9 models in total. The reason is that prompt 2 was scored in 
two different domains (Writing Application and Language 
Conventions). Therefore, we must create two separate predictions 
for this essay prompt. To train all nine models, we used different 
hyperparameters for each model. 


4. RESULTS 


4.1 Model Performance Evaluation 

In this subsection, we evaluated the model performance based on 
the quantitative evaluation criteria as discussed in Section 2. 
Furthermore, we also analyzed the distribution of overall holistic 
scores assigned by human raters and computer. 


4.1.1 Score Agreement Evaluation 

We conducted the quantitative evaluation of our models based on 
the acceptance criteria by Williamson et al. in [1], and the results 
are shown in Table 3. The table describes the performance 
measurements for the scoring model of each prompt. The first 
column is the QWK score, which measures the human-computer 
agreement. The human-human agreement score in the second 


3 https://github.com/jxmorris12/language_tool_python 


column is used to calculate the degradation value. The last column 
contains the standardized mean score difference between human 
and automated scores. 


Table 3 Model Performance Evaluation for ASAP Dataset 


Dataset ls eet Degradation 7 
1 0.7826 0.72095 -0.06165 0.0056 

2_dom1 | 0.6731 0.81413 0.14103 0.0007 

2_dom2 | 0.6715 0.80175 0.13025 0.0394 
3 0.6887 0.76923 0.08053 0.0272 
4 0.7736 0.85113 0.07753 0.0094 
5 0.8065 0.7527 -0.0538 0.0229 
6 0.7985 0.77649 -0.02201 0.0102 
q 0.7771 0.72148 -0.05562 0.0023 
8 0.6668 0.62911 -0.03769 0.0147 


Based on the above results, we concluded that five models (prompt 
1, 4, 5, 6, and 7) satisfy the quantitative evaluation criteria defined 
as a standard in [1]. Next, we continued our analysis and the 
reliability and validity tests on only these well-performing models 
and ignored the other underperforming models (prompt 2_doml, 
2_dom2, 3, and 8). 


4.1.2 Distribution of overall holistic scores 

We investigated the distribution of the overall holistic scores 
assigned by human raters and the automated scorer. It is important 
to understand the distribution of the scores, especially in relation to 
the decision of an exam. For this purpose, we presented the decision 
into three categories: passing, borderline, and failing. Table 4 
shows the rubric score and resolved score range in ASAP dataset, 
in which our model passed the quantitative performance evaluation 
in the previous subsection. The resolved score is the final score 
after combining the rubric scores from two human raters. Each 
prompt has a different score resolution. In some prompts, if there 
was a difference between scorer | and scorer 2, the final score was 
always the higher of the two. In another prompt, the final score was 
the sum of scores from rater | and rater 2 if their scores are adjacent. 
If non-adjacent, an expert scorer will determine the final score. 


Table 4 ASAP Rubric and Resolved Score Range 


Rubric Resolved Borderline 
Dataset 

score Score Score 
Prompt 1 1-6 2-12 6-8 
Prompt 4 0-3 0-3 1 
Prompt 5 0-4 0-4 2 
Prompt 6 0-4 0-4 2 
Prompt 7 0-12 0-24 12 


To categorize the exam results, we created a borderline score from 
the resolved score. It is not necessarily the exact middle score 
because some datasets do not have it. We considered scores below 
the borderline as failing and scores above the borderline as passing. 
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Table 5 Frequency Comparison for Passing, Borderline, and Failing (in %) 


ASAP 1 ASAP 4 ASAP 5 ASAP 6 ASAP 7 
HR1 | HR2 | AES | HR1 | HR2 | AES | HR1 | HR2 | AES | HRI | HR2 | AKS | HRI | HR2 | AES 
Passing 35.2 | 35.5 | 50.1 | 39.6 40 | 45.2 | 37.8 | 39.2 | 45.9 | 59.4 | 59.1 65 74 72.8 82 
Borderline | 62.7 | 62.5 | 48.8 | 42.7 | 41.9 | 42.3 | 38.5 36 38.7 | 25.7 | 25.9 | 25.6 | 8.9 9.8 4.9 
Failing 2.1 2 1.1 17.7 | 18.1 | 12.5 | 23.7 | 24.8 | 15.4 | 14.9 15 9.4 | 17.1 | 174 | 13.1 
Total 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 


As can be seen from Table 5, the use of automated scorer will put 
the students at an advantage, compared to when the exam only 
employs human raters. In all datasets, we have the same overview. 
The AES models assigned scores with much higher passing rates 
than both human raters. And as far as the failure rates are 
concerned, the scores from AES indicate very low failing rates, 
compared to human raters. This finding supports the result in [15], 
with a different dataset for the experiment. 


4.2 Evaluating and Improving the Model 

We examined the performance of the model using three scenarios: 
gibberish answer, paraphrased answer, and off-topic answer. We 
discuss the evaluation results and strategies to improve the system 
in the following sections. 


4.2.1 Off-topic essay 

One way to validate the use of an automated essay scoring system 
is by checking its performance against off-topic answers. For this 
study, we use ASAP dataset which has eight sub datasets (prompts). 
To simulate the experiment, for each model, we used the answers 
in the other seven prompts as the off-topic essays. We randomly 
sampled 50 essays from each dataset, resulting in a total of 350 off- 
topic essays. Using 5-fold cross-validation, we measured the 
accuracy of the model in predicting the score of the off-topic 
essays, which we assume should get 0 (zero), due to the complete 
irrelevance with the corresponding prompt. 


Table 6 Accuracy of off-topic detection 


ts Accuracy | QWK 
Dataset Training Data (%) ae 
Prompt | Original 0% 0.7826 
(2-12) | Original + 350 off-topic | 55.4% | 0.7031 
Prompt 4 Original 4.3% 0.7736 
(0-3) Original + 350 off-topic | 91.4% | 0.7697 
Prompt 5 | Original 0.6% | 0.8065 
(0-4) Original + 350 off-topic 88% 0.7951 
Prompt6 | Original 5.80% 0.799 
(0-4) — 
Original + 350 off-topic 97% 0.787 
Prompt7 | Original 0% 0.7771 
(0-24) | Original + 350 off-topic | 45.7% | 0.7225 


We also investigated the change of value of the QWK scores after 
retraining the model. The motivation is, we want to avoid that the 
retraining process degrades the performance of the original model. 
The new model should still perform well in predicting the original 
essay set. 
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Table 6 describes the experiment results of the retraining process, 
the improvement in accuracies, and the change in QWK scores. The 
models trained with only the original dataset performed with very 
low accuracies. The highest result is by prompt 6, with 5.8% of the 
off-topic essays are correctly given 0 scores. Prompt 1 and prompt 
7 are the worst with no correct prediction at all. It means all off- 
topic essays are graded with scores greater than 0. 


We can observe the effect of including the off-topic essays in the 
training data. The model performance for prompt (4, 5, and 6) 
drastically increase. For prompt 6, the accuracy on predicting the 
unseen data of the off-topic essays (test set) reached 97%, with only 
a slight decrease on the QWK score. There are moderate 
improvements for prompt 1 and especially prompt 7. We assume 
this is due to a larger score range, which for prompt 7 is 0 — 24. If 
we are being less strict about the score 0 (zero) policy for off-topic 
essays, for example by categorizing score between 0 — 3 as failed 
score, we obtain a much better accuracy, which is 81.7%. 
Meanwhile, for prompt 1, the lowest resolved score is 2. However, 
we trained the model to give off-topic essays 0 score prediction. If 
we create a score range 0 — 2 as failed category, the accuracy 
increases to 93%. Because we have many score predictions for the 
off-topic answers ranging from 0 to 2 by prompt 1. 


ASAP DATASET 1-8 
PCA t-SNE 


Figure 1 SBERT representation for 8 prompts in ASAP 
dataset 


We conclude that the solution for detecting off-topic essays is 
relatively simple. Without using additional features for detecting 
off-topic essays, we found out that the SBERT features are very 
helpful to be used as features for training a scoring model. Using 
PCA and t-SNE, we plotted the SBERT vector representation of the 
essays in all eight of ASAP dataset prompts. Figure 1 shows that 
using t-SNE, all prompts are almost perfectly separated, although 
we can see both dataset 7 and dataset 8 are close to each other in 
the middle of the plot. We assume that it is caused by their similar 
prompt topics. If we check Table 1 for the description of topics in 
ASAP dataset, in prompt 7 the students are asked to write a story 
about patience, while prompt 8 discusses about the benefits of 
laughter. Both topics are arguably more closely related to each 
other than when we compare them with the topics of prompt 1 — 
prompt 6. 
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4.2.2 Gibberish 


For the next input scenario, we want to avoid that the system 
receives invalid answers such as gibberish, and undeservedly 
returns scores other than zero. Ideally, any gibberish answer must 
get the score zero. However, using our model, we tested several 
gibberish as the answers, and the scores are not zero. We provide 
some examples of the inputs as shown in Table 7. In this table, we 
show the examples of wrong predictions by the scoring model that 
was trained using ASAP dataset prompt 6. Nevertheless, we can 
also observe similar problems in the other models. 


Table 7 Examples of Wrong Predictions by Prompt 6 


Answer Score (0-4) 
asdafafdf adjhgladghad 1 
Eyoqtuwrpituauoyego 
ngbambgagadhkq3124 31794613 1 
hbfka df 
orkesbroh 1 


To analyze the reason, we conducted local model interpretation, 
which means that we are interested in understanding which 
variable, or combination of variables, determines the specific 
prediction. We use SHAP (SHapley Additive exPlanations) values 
to help in determining the most predictive variables in a single 
prediction [16]. In AES, the system output is a real number. Each 
variable contribution will either increase or decrease the output 
value. One implementation of SHAP libraries is TreeExplainer, a 
faster library for obtaining SHAP values for tree ensemble methods 
{17]. 


A SE CECE 


spur sssTee 


Figure 2 Local interpretation for a single prediction 


From Figure 2, it seems that some of the most important 
interpretable features (number of unique words, answer length, and 
prompt overlap) have the correct effects to the prediction, they play 
a role in decreasing the score. However, there is one feature from 
the SBERT vector representation that helps the score of gibberish 
answers to increase, i.e. sbert_356. Most of the wrong gibberish 
predictions have a similar explanation to the one shown in Figure 
2. SBERT vector is not interpretable, therefore we cannot explain 
and analyze the reason of this peculiar model behavior. We propose 
two solutions to this problem as follows: 


1. Retrain model using gibberish data, with label score zero. 


We created 200 gibberish essays and transformed them into a 780- 
dimension vector representation of text, and then include them in 
the training and testing data. Table 8 shows the performance 
comparison of three model training scenarios. We can observe a 
large improvement on the accuracy of the model to detect gibberish 
answers, and to punish them with the scores zero. For example, by 
prompt 1, the accuracy increases from 0% to 92.8% by adding only 
100 gibberish data to the model training. By adding 100 more data, 
the accuracy only improves by a little less than 2%. After changing 
the training data by adding gibberish for the training process, we 
want to make sure that the main performance metrics (QWK score) 
on the original data is not being sacrificed. The results show that in 


* https://spinbot.com/ 


all prompts, the addition of gibberish data to the training phase did 
not harm the performance of the models. The QWK scores 
decreased by a very small margin, and they still have the human- 
computer agreement score above the required threshold. Even in 
prompt 7, the final QWK score increased with the addition of 
gibberish to the training set. 


Table 8 Accuracy of Gibberish Detection 


Dataset Training Data oo y .. 
Original 0 0.7826 

Prompt 1 | Original + 100 gibberish 92.8 0.7768 
Original + 200 gibberish 94.4 0.7683 

Original 18.6 0.7736 

Prompt 4 | Original + 100 gibberish 97.8 0.779 
Original + 200 gibberish 98.6 0.7749 

Original 3.5 0.8065 

Prompt 5 | Original + 100 gibberish 98 0.7986 
Original + 200 gibberish 98.6 0.8049 

Original 18.2 0.799 

Prompt 6 Original + 100 gibberish 96 0.794 
Original + 200 gibberish 97.9 0.782 

Original 0 0.7771 

Prompt 7 | Original + 100 gibberish 68 0.7798 
Original + 200 gibberish 73.4 0.781 


2. Use rule-based mechanism. 


This is arguably a simpler solution, without the need to involve any 
additional model retraining process and could be a more 
generalizable solution. The system is configured to automatically 
give score zero for possible gibberish answer, this can be done 
automatically for example with a valid English word detection 
library. If none of the token in the answer is valid English 
vocabulary, we can consider the answer as gibberish. The main 
drawback is possibly the added processing time for the program to 
validate each word in the answer, depending on how large the 
vocabulary is. 


4.2.3 Paraphrased Answer 

To further evaluate the performance of the system, we investigated 
the reliability of the system by testing whether the model 
consistently gives the same score for the same answer. For this 
experiment, we generated paraphrased answers of all answers in the 
dataset. And we examine whether the model would predict the same 
score for each paraphrased answer. We utilized an online 
paraphrasing tool* to generate the paraphrased version of the 
answer. 


We use Quadratic Weighted Kappa (QWK) to compute the 
agreement between the original test set prediction and the 
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paraphrased test set prediction. Based on the scores in Table 9, it is 
evident that the agreements for all datasets are high. Therefore, we 
conclude that the models perform consistently in predicting the 
scores of paraphrased answers. 


Table 9 Agreement of Prediction between Original and 
Paraphrased Answers 


Dataset QwWK 
Prompt 1 0.8109 
Prompt 4 0.8674 
Prompt 5 0.8645 
Prompt 6 0.846 
Prompt 7 0.9411 


The highest agreement score is achieved by the model of prompt 7, 
which shows a near perfect agreement with QWK score of 0.9411. 
Although not as high as the result of prompt 7, the QWK scores in 
prompt 1, 4, 5, and 6 are considered as very high agreement. 


5. CONCLUSION 


The purpose of this research is to highlight the limitations of the 
current performance measurement standard for automated essay 
scoring. A quantitatively well-performing model with high human 
— automated score agreement rate, is not necessarily ready for 
deployment in the real-world usage. We demonstrated that such 
models still possess some performance concerns against varying 
input scenarios. We showed empirical evidence that those models 
have some difficulties, proven by very low accuracies, in detecting 
off-topic essays and gibberish. We also proposed and proved 
several strategies that can successfully improve the performance of 
the system. In another scenario, for consistency testing, the models 
already performed quite well for predicting paraphrased answers, 
judged from high agreement results with the predictions on the 
original answer. While we are aware that there remain more validity 
questions to be studied, this research can serve as additional 
techniques towards a better holistic evaluation framework for AES. 
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ABSTRACT 


Studying for entrance examinations can be a distressing pe- 
riod for numerous students. Consequently, many students 
decide to attend cram schools to assist them in preparing 
for these exams. For such schools and for all educational 
institutes, it is necessary to obtain the best tools to provide 
the highest quality of learning and guidance. Performance 
prediction is one tool that can serve as a resource for insights 
that are valuable to all educational stakeholders. With ac- 
curate predictions of their grades, students can be further 
guided and fostered in order to achieve their optimal learning 
goals. In this regard, we target middle school students to be 
able to guide them on their educational journey as early as 
possible. We propose a method to predict the students’ per- 
formance in entrance examinations using the comments that 
cram school teachers made throughout the lessons. Teachers 
in cram schools observe their student’s behavior closely and 
give reports on the efforts taken in their subject material. 
We show that the teachers’ comments are qualified to con- 
struct a tool that is capable of predicting students’ grades 
efficiently. This is a new method because previous studies 
focus on predicting grades mainly using student data such 
as their reflection comments or earlier scores. Experimen- 
tal results show that using readily available feedback from 
teachers can remarkably contribute to the accuracy of stu- 
dent performance prediction. 


Keywords 
text mining, student grade prediction, teacher observation 
reports, machine learning 


1. INTRODUCTION 


“If you could reinvent higher education for the twenty-first 
century, what would it look like?”. A question like this one 
invites many observations about the advantages and issues 
that the current state of higher education has in the world. 
As a matter of fact, this question has been addressed specif- 
ically by the founders of the Minerva Schools at KGI 
in the United States. At such innovative universities and 
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schools, active learning and student engagement with the 
material are highly encouraged (6. Additionally, 
the student/teacher ratio is expected to be lower than in 
traditional schools for higher teacher effectiveness |7|. Stu- 
dents are assessed and observed closely by their teachers 
and they can receive written feedback from their teachers 
daily. These reports clarify any confusion, reinforce strong 
points and give more specific advice and guidance : 
Besides, since teachers frequently engage with students, re- 
search has proven that these teachers, especially those with 
professional development, can accurately judge and forecast 
their students’ computational skills [10]. 


In this paper, we propose a novel method for predicting stu- 
dents’ performance or final grades. We show that we can 
use reports carefully written by teachers that closely observe 
the students, to construct a grade prediction model. If these 
predictions can be made accurately, it would be an invalu- 
able resource to help the teachers better regulate their stu- 
dents’ learning. Future performance prediction is considered 
a powerful means that can provide all educational stakehold- 
ers with insights that are beneficial to them. Many grade 
prediction models have been proposed by researchers in the 
last decade 13], but no model has used teacher re- 
ports as far as we know. The teacher reports we use are 
provided by a cram school in Japan. Cram schools are 
specialized in providing extra and more attentive education 
for students who want to achieve certain goals, particularly 
studying for high school or university entrance exams [4]. 
To capture the meanings of the teacher reports, we obtain 
vector representations by applying the term-frequency in- 
verse document-frequency (TF-IDF) method and extracting 
BERT embeddings. Our model uses these vectorized reports 
as the explanatory variables for a Gradient Boosting regres- 
sor. The regressor then predicts the students’ scores. Our 
experiment results show that when adding teachers’ reports 
to the regular student exam scores, we can predict their let- 
ter grade with an accuracy up to 62%. To sum up, our 
contributions can be outlined as follows: 


e We propose a new performance prediction method us- 
ing teacher observation reports represented using TF- 
IDF and BERT. 

e We conducted 2 main different models of prediction 
and compared the experiment results to show that us- 
ing teacher reports has the potential to contribute to 
an increase in accuracy of grade prediction models. 


All in all, to the best of our knowledge, this is the first 
study to use NLP to mine teacher observation comments 
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to predict student grades. Our research and experimental 
results demonstrate the potential that these unstructured 
teacher observation comments have in predicting students’ 
total scores and final letter grades. 


2. RELATED WORK 


The utilization of data mining and machine learning or deep 
learning tools to construct predictive models are increas- 
ingly being adopted in many different fields [16]. Need- 
less to say, the educational field has not been an exception. 
Topics in educational data mining vary widely from course 
recommendation systems to automatic assessment [18]. 
More specifically, an extensive amount of studies have been 
dedicated to prediction modeling whether it be predicting 
student grades or performance such as next-term grade pre- 
diction or student dropout. These prediction models 
are essential since they underlie applications to important 
educational AlI-based decision-making systems [20]. With 
accurate predictions, the performance of students can be 
monitored using these systems, and students that have dif- 
ficulties in their studies can easily be detected and given 
further guidance early on. 


Over the past years, several methods have been developed 
to predict student’s performance using Natural Language 
Processing (NLP) techniques. It has been proven that min- 
ing unstructured text using NLP has the capacity to con- 
tribute to accurately predicting students’ success over the 
information obtained from usual fixed-response items [21]. 
Luo et. al proposed a method to predict student grades 
based on their free-style reflection comments collected after 
each lesson. The comments were collected according to the 
PCN method that categorizes the students’ comments. 
To represent the students’ reflection comments, Word2Vec 
embeddings were adopted followed by an artificial neural 
network. Their experiments show a correct rate of 80%. 
Teacher or advisor notes have been used by Jayaraman, not 
to predict student grades, but to detect students that are 
at risk of dropping out of college [23]. In their study, they 
use sentiment analysis to extract the positive and negative 
sentiment from the advisors’ notes and use those as features 
to train a model. The model achieves 73% accuracy at iden- 
tifying at-risk students. 


3. DATA DESCRIPTION 


The dataset obtained and used for our model was provided 
by a cram school in Fukuoka, Japan. To ensure confiden- 
tiality, no student names or other identifying data were pre- 
sented. Reports were obtained monthly and sent as CSV 
files. Since our model is focused on predicting the perfor- 
mance of students in their entrance examinations, we fo- 
cused on those students in their final year of middle school. 
The final dataset after preprocessing composed of 11,960 re- 
ports over the period from May to October for 159 students. 


3.1 Monthly Reports 

In addition to the student ID and the class date, each report 
also consisted of the subject code, the teacher’s comments, 
understanding, attitude and homework scores. More data 
in the reports were also provided but were unstructured and 
considered redundant for the prediction model. The fea- 
tures that were extracted from the reports and used in the 


Table 1: Number of Reports in Each Subject 

Japanese Math Science Social English 
1157 3547 2428 1669 3159 
(9.7%) (29.6%) (20.3%) (14%) (26.4%) 


Number of Reports 


study are discussed in more detail in Section [4.1] However 
our main explanatory variable used in the study is the teach- 
ers’ observation comments written in Japanese. The average 
length of these comments is 96 characters. In addition, by 
analyzing comments, it was observed that teachers tend to 
encourage and energize their students by using words such 
as “better” and ”work on”. Moreover, the words used in the 
comments depend on the context or class subject to some 
degree. For example, the expression ”calculation problem” 
is likely to be used in math lessons. 


In the cram school, students take different lessons for each 


subject. These lessons fall under the 5 main subjects: Japanese, 


Mathematics, Science, Social Studies and English. Since the 
main objective of our model is to predict a student’s total 
score, reports in all 5 subjects are required. Therefore, test- 
ing the model was only possible for those students who at- 
tended classes for all subjects. The number of reports that 
fall under each subject are shown in Table[I] The values in 
the table show that the most taken lessons and therefore the 
most reports provided were in the subject of Mathematics 
followed directly by English. The number of total reports 
for each student varied depending on the classes attended. 
The average number of total reports recorded for each stu- 
dent was 82 reports with a maximum of 206 and a minimum 
of 24 reports. 


3.2 Test Scores 


Students attending the cram school were naturally regis- 
tered in many different schools. The results of their regu- 
larly taken examinations at school were recorded and pro- 
vided. These scores were what we considered student data 
and would be traditionally used as the main feature to pre- 
dict their performance in the entrance exam. To teach the 
model to perform these predictions, we adopted the super- 
vised learning method. In supervised learning, training data 
needs to be labeled with the required outputs for each in- 
put. This enables the model to train its learning function by 
altering it based on the correct result so that the function 
can then be applied to new inputs. In our study, we used 
the students’ results in their cram school simulation exams 
as the labels for the model since their actual performance in 
the entrance exam was unattainable. 


The simulation scores for the 159 students were recorded for 
all subjects and also provided as the total score. To visual- 
ize the distribution of the students’ scores, histograms were 
plotted as shown in Figure [I] The shape of the graph for 
the subject scores distribution and total score distribution is 
approximately bell-shaped and seems symmetric about the 
mean, so it is assumed that the scores follow the normal 
distribution. The standard deviation, o, for all scores are 
displayed in Table [2] to show how dispersed the values are. 


4. METHODOLOGY 
4.1 Feature Selection 
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Figure 1: Distribution of Simulation Test Scores 


Table 2: Standard deviation of subject scores 


Japanese Math Science Social English | Total 


oO 11.85 16.62 20.33 16.93 18.47 | 70.01 


For our experimental settings, we adopt 3 main feature sets 
for the sake of comparison. The first feature set, FSi, con- 
sists of using teachers’ report contents as the main explana- 
tory variables. A teacher’s report in one lesson evaluating 
the student consists of 1-Comments 2- Understanding Score 
3- Attitude Score and 4-Homework Score. We use all of these 
attributes except for the homework score. This is mainly be- 
cause more than 36% of the reports did not include home- 
work scores since not all lessons necessarily require home- 
work. After each lesson, the teacher writes some comments 
based on their observations, assesses the student on their un- 
derstanding giving them a score of either (0-30-60-80-100) 
and an attitude score of either (1-2-3-4). The second feature 
set, FS2, consists of student-related data only, specifically 
their gender and the score of their regularly scheduled exam 
at school. Since we predict each subject score separately, 
the regular score corresponds to the subject score. As for 
the students’ gender, the Pearson correlation coefficient be- 
tween it and the score is 0.12 while the correlation coefficient 
between the regular score and the simulation score is 0.80 
which suggests that the important factor in FS2 is essen- 
tially the student regular score and not the gender. Finally, 
we investigate using both teachers’ reports and the regu- 
lar student scores to verify whether adding teachers’ reports 
contributes to the accuracy of the prediction model or not. 
The third feature set, FS3, is essentially a concatenation of 
FS; and FS2. A sample of FS; is shown in Table [3] 


4.2 Natural Language Processing 

There are numerous ways to represent text data for a ma- 
chine learning model to convey the original meanings of 
the text and prevent information loss. In our experiments, 
we chose to represent the teachers’ comments using two 
techniques. We used the traditional TFIDF vectorization 
method and compared it with BERT embeddings. 


4.2.1 TF-IDF 


The first essential step in transforming text into a numer- 
ical representation is preprocessing the text. This step be- 
gins with tokenization or splitting the sentences into words. 
Tokenization in languages such as English can be done by 
splitting the sentence strings at each space. However, for 
Japanese, this step is merged with the next, which is mor- 
phological analysis, since there are no spaces in Japanese 
sentences. We use the fugashi parser for this step, which 
is essentially a wrapper for Mecal{}| a Japanese tokenizer 
and morphological analysis tool. Our parser extracts from 
each report the following parts of speech: nouns, verbs, aux- 
iliary verbs, adjectives and adverbs. We use the correspond- 
ing terms to these extracted parts of speech to build a bag- 
of-words vector with weights given by the TF-IDF method 
implemented by sklearn [25]. Since the teachers’ comments 
are given in Japanese, we provide the mentioned parser to 
the tokenizer parameter. We also give a list of predefined 
Japanese stop words to the vectorizer. 


4.2.2 BERT 

BERT or Bidirectional Encoder Representations from Trans- 
formers is a new method of pre-training language represen- 
tations presented by Google [26]. BERT obtains state-of- 
the-art results on many NLP tasks. It is a Transformer 
Encoder stack that pre-trains language representations. A 
pre-trained BERT model is basically a general purpose lan- 
guage understanding model trained on a large corpus which 
can then be used for downstream tasks. The BERT model 
we used for the comments was pretrained by Inui Labora- 
tory, Tohoku University] The corpus they used for pretrain- 
ing was Japanese Wikipedia and the model was trained with 
the same configuration as the original BERT. In the experi- 
ments shown in this paper, we used the BERT [CLS] token 
embeddings as our BERT embeddings. 


4.3 Evaluation Metrics 

To evaluate our experiments, we use the Mean Absolute Er- 
ror (MAE) metric. The MAE is calculated using the follow- 
ing formula : 


1 n 
MAE= . x |Scorepred,i — SCOTCtrue,i| (1) 


i=l 


where score¢rue,i is the actual score that student 7 obtained. 
The predicted score (scoreprea,i) is calculated differently for 
subject scores and total score. For a specific subject s € 
S, where S = {Japanese, Math, Science, Social Studies, En- 
glish}, a student i can attend a variable number t of lessons. 
Therefore, to predict the subject score (SubjectScore,,..4.;,5) 
of student 7 we use each of their reports as independent in- 
puts to the model and obtain an ordered list X;,5,4 of pre- 


https: //taku910.github.io/mecab/#parse 


‘https://github.com/cl-tohoku/bert- japanese 
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Table 3: A sample of FS;: teachers’ reports (comments originally in Japanese) 


Understanding | Attitude Comments 

80 4 We are trying applied problems of resolution into factors. 
You look like making many mistakes carelessly, but know formulas very well. 

80 4 We are trying applied problems of resolution into factors. 
You look like making many mistakes carelessly, but know formulas very well. 
He took notes while watching the commentary and focused on the problem. 

100 4 If you keep going at this rate, you will be able to meet the target, the 5th time. So, let’s do 
our best! 


dicted scores for student;. The estimated score for the sub- 
ject is then decided using: 


SubjectScore = Med 


— 


Xi,s,t) 
_ eee if t is even 


pred,i,s 


5 (Xi[5+] + Xi[4+]), if t is odd 
(2) 


To measure the central tendency, we used the median rather 
than the mean as it is robust to skewness and outliers. Nev- 
ertheless, if the estimations follow a normal distribution, 
the median would be close to the mean. The total predicted 
score (TotalScoreprea,i) can then be estimated by: 


TotalScoreprea,i = =: SubjectScore,,,..a,i,s (3) 
ses 


Finally, since students receive letter grades for their total 
score, we map the estimated total score to its closest cor- 
responding letter grade according to the percentages shown 
in Table [4] [27]. We then compute the percentage of grades 
that are x ticks away from their actual grades. A tick, as 
specified by [28], is defined as the difference between two 
successive letter grades. We name this metric percentage by 
tick accuracy or PTA. PTAo stands for the Percentage by 0 
Tick Accuracy which means the model successfully predicted 
the letter grade with no error while PTA, is the percentage 
of incorrectly predicted grades but are 1 tick away from the 
true letter grade (e.g. A vs B). A similar metric was used in 
previous studies regarding grade prediction models [28). 


Table 4: Letter grades and their corresponding percentages 
Grade S A B C D F 
% 90-100 80-89 70-79 60-69 50-59 0-49 


5. EXPERIMENTS 
5.1 Model Overview 


In our experiments, we adopt gradient boosting, a composite 
machine learning algorithm. We employed its sklearn im- 
plementation, GradientBoostingRegressor to predict 
the continuous value of the students’ scores in each subject. 
Since there is no prior research on the effect of using teacher 
observation reports in predicting students’ grades, we use 
the following method as the baseline in our experiment. At 
first, subject codes were unavailable for each teacher obser- 
vation record. Therefore, we constructed a model that used 
all of each student’s reports, regardless of the subject, to 
directly predict and estimate the total score according to 


Equation [2] We call this model, the ’Direct’ model. Subject 
codes then became accessible and we were able to map each 
report to its corresponding subject. Leveraging that, we cre- 
ated a separate regression model for each subject’s reports 
and estimated the total score as shown in Equation [3] This 
model is called the Subjects’ model. 


5.2 Experimental Results 

All experiments in the study were evaluated using group 10- 
fold cross-validation. The advantage of group k-fold cross 
validation method is that all data are used for both training 
and testing, and each instance is used for testing once. This 
is especially useful in situations where data is limited. Since 
the dataset comprises reports for 159 students, we used 143 
students’ reports for each fold as the training set and 16 as 
the testing set. The number of reports or instances for each 
subject model, therefore, varied depending on how many 
lessons each student had attended. The average MAE, which 
is calculated as in Equation[I] of all ten folds was computed 
and used as the main evaluation metric. We ran the baseline 
Direct model with the 3 feature sets described in Section[4.1] 
Teachers’ comments were represented using BERT embed- 
dings. The performance results are shown Using all 3 feature 
sets, the Subjects model consistently outperforms the Direct 
baseline model. Specifically, predicting the total score using 
the Subjects model with FS3, which uses both teachers’ re- 
ports and student data, resulted in a decrease in MAE of 
5.62. Using teachers’ reports alone (FS1) resulted in a com- 
paratively higher MAE in both models. However, adding 
teachers’ reports to student data (FS3) showed a smaller 
value in MAE than using student data only (FS2) which 
suggests that teachers’ reports as features can contribute to 
the accuracy of the grade prediction model. 


Table [6] shows the MAE, PTAo and PTA, of each subject’s 
score prediction model. We ran the subject model with all 3 
feature sets. For FS; and FS3, we compared the performance 
of the two text representations, TF-IDF vectors and BERT 
embeddings. Values in bold indicate the leading scores for 
each metric in all subjects. In terms of MAE, using FS3 
consistently outperforms the other feature sets. It can also 
be seen that BERT embeddings tend to have better overall 


Table 5: Average MAE of total score prediction with Direct 
model vs Subject model using the 3 feature sets: FS2: student 
data, FS,: teacher reports, FS3: FS, + FS, 

FS, | FSi | FS3 
42.73 | 53.81 | 38.91 
36.83 | 52.02 | 33.29 


Direct 
Subjects 
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Table 6: Evaluation metric scores in all subjects using the 3 feature sets and comparing between using TFIDF for text repre- 


sentation vs using BERT embeddings. 


Values in bold indicate the best metric value in a specific subject. 


| Japanese | Math Science Social Studies | English | Total 
| MAE PTA, PTA: | MAE PTA, PTA: | MAE PTA p PTA; | MAE PTA PTA; | MAE PTAo PTA: | MAE PTAo PTA1 
FS2 | 10.32 0.37 0.20 10.96 0.53 0.12 | 15.02 0.49 0.12 | 13.43 0.51 0.087 12.48 0.58 0.12 | 36.83 0.58 0.15 
TFIDF FS: | 9.79 0.36 0.22 12.53 0.47 0.10 17.25 0.37 0.09 13.57 0.60 0.01 14.93 0.52 0.00 54.81 0.47 0.07 
FS3 | 9.16 0.38 0.20 10.37 0.50 0.16 14.07 0.44 0.16 12.08 0.56 0.086 | 12.10 0.58 0.13 35.19 0.621 0.14 
BERT FS, | 9.47 0.27 0.23 12.36 0.45 0.07 16.66 0.40 0.11 13.92 0.55 0.02 14.51 0.52 0.02 52.02 0.49 0.07 
FS3 | 9.32 0.37 0.22 10.12 0.52 0.18 13.31 0.43 0.18 12.00 0.53 0.095 | 10.99 0.62 0.11 33.29 0.622 = 0.17 
Average MAE for Each Subject Score Prediction Average PTA for Score Prediction 
= 0.8 mmm FS2: PTAg 
16 mm FS; mmm FS2: PTAg + PTAy 
FS3 0.7 mmm =FS3: PTAo 
14 mmm FS3: PTAg + PTA, 
0.6 
12 
z 10 = OP 
o 
: ‘ f 04 
Z 2 
6 0.3 
4 0.2 
2 0.4 
0 0.0 


Japanese Math Science Social Studies English 


Figure 2: Average MAE of subject scores across all FS 


performance than the TF-IDF vectors. Moreover, running 
the Subjects model with FS; using BERT resulted in lower 
MAE than when using TF-IDF. Finally, when predicting the 
total score, using FS3 with BERT held the top scores across 
all evaluation metrics. 


Figure[2] depicts the performance of each subject seperately 
in terms of MAE across the three feature sets. It can be ob- 
served that FS3 continuously achieves lower MAE than FS2 
and FS. In addition, as shown in Figure [| FS3 also con- 
sistently achieves higher overall PTA. When predicting the 
total score, FS3 shows an increase of 6.2% in PTAo + PTA,. 
These results provide evidence and suggest that teachers’ 
reports can in fact add value and contribute to grade pre- 
diction models. 


6. DISCUSSION 


The results presented in the previous section can be sum- 
marized into the following main points. 

e The highest performance of the grade prediction model 
can be achieved by using a concatentation of the two 
feature sets, FS; and FSg. 

e When predicting the total score with teachers’ reports, 
using BERT embeddings outperforms TF-IDF. 

The success of BERT can be attributed to the fact that 
the BERT model has been pretrained on huge corpora of 
Japanese text data. TFIDF vectors, on the other hand, only 
use the data on hand to produce the representations. How- 
ever, an important advantage of TFIDF is that the numer- 
ical vector representations are computed much faster than 
extracting BERT embeddings. To further increase the ac- 
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Japanese Math Science Social Studies English Total 


Figure 3: A comparison of PTA metric evaluated when using 
FS2 and FS3 across all subject scores and total score 


curacy of the prediction model considering FS3 and FSj, we 
aim to pre-train BERT on each of the 5 subject reports. It 
has been proven that pretraining BERT on specific domains 
can lead to a significant increase in performance [29]. 


7. CONCLUSION 


At educational institutes where students are closely observed 
by their teachers, large amounts of unstructured data exist 
in the form of reports and comments. In this paper, we at- 
tempted to employ and take advantage of these comments 
to help identify students that may need extra guidance or 
attention. Our model used teacher observation comments 
to predict students’ total scores. We applied both TF-IDF 
and BERT embeddings to the observation comments and 
used the vectors as inputs to a gradient boosting regres- 
sor. Three main feature sets were employed in our model, 
teacher-related features, student-related features, and a con- 
catenation of both. The performance of our model on each 
set was then demonstrated. Our experimental results showed 
that the readily available teachers’ reports have the potential 
to create a grade prediction model. Using teachers’ reports 
can increase the accuracy of a grade prediction model that 
uses only students’ previous exam scores by 6.2%. However, 
there remains room for improvement in our experiments. We 
believe that with more teachers’ comments, the accuracy of 
our model could increase. We also plan to enhance the text 
representations by pretraining BERT on the teachers’ com- 
ments in advance. Additionally, we intend to experiment 
with another model architecture that would focus on clas- 
sifying the students’ performance first. We hope that with 
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such well-defined grade prediction models, we can help guide 
young students and provide a more focused and personalized 
education to them. 
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ABSTRACT 

Massive Open Online Courses (MOOCs) which enable large- 
scale open online learning for massive users have been play- 
ing an important role in modern education for both students 
as well as professionals. To keep users’ interest in MOOCs, 
recommender systems have been studied and deployed to 
recommend courses or videos that a user might be inter- 
ested in. However, recommending courses and videos which 
usually cover a wide range of knowledge concepts does not 
consider user interests or learning needs regarding some spe- 
cific concepts. This paper focuses on the task of recom- 
mending knowledge concepts of interest to users, which is 
challenging due to the sparsity of user-concept interactions 
given a large number of concepts. In this paper, we propose 
an approach by modeling information on MOOC platforms 
(e.g., teacher, video, course, and school) as a Heterogeneous 
Information Network (HIN) to learn user and concept rep- 
resentations using Graph Convolutional Networks based on 
user-user and concept-concept relationships via meta-paths 
in the HIN. We incorporate those learned user and concept 
representations into an extended matrix factorization frame- 
work to predict the preference of concepts for each user. Our 
experiments on a real-world MOOC dataset show that the 
proposed approach outperforms several baselines and state- 
of-the-art methods for predicting and recommending con- 
cepts of interest to users. 


Keywords 
User Modeling, MOOC, Learning Analytics, Knowledge Con- 
cept, Recommender Systems 


1. INTRODUCTION 


MOOGCs (Massive Open Online Courses), which are free on- 
line courses available to anyone to enroll around the world, 
have gained a lot of popularity in the past decade. B 

the end of 2018, popular MOOC platforms such as il 
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and Courserq’| have provided 11,400 courses with 101 mil- 
lion users/learners on those platform] Previous studies 
have shown that MOOCs do have a real impact [24\ (8). For 
example, Chen et al. showed that 72% of survey respon- 
dents reported career benefits and 61% reported educational 
benefits. Despite of the popularity, one main challenge of 
MOOCs is the overall completion rate of those courses is 
normally lower than 10% [19} [30]. Therefore, understanding 
and predicting user behaviors and learning needs are impor- 
tant to keep users learning on MOOC platforms. 


To this end, previous studies have focused on understanding 
dropout or procrastination behavior and rec- 
ommending content such as courses and learning paths that 
a user might be interested in [4]. A MOOC can be 
seen as a sequence of videos where each video is associated 
with some knowledge concepts. For example, a video in a 
computer science MOOC can cover several concepts such as 
“software” and “hardware”. More recently, Gong et al. 
argued that course or video recommendations overlook user 
interests regarding specific knowledge concepts. For exam- 
ple, data mining courses taught by different teachers can be 
quite different in a microscopic view, and a user who is in- 
terested in some specific concepts such as “association rules” 
might be interested in various video clips or learning materi- 
als from different teachers covering those concepts from dif- 
ferent perspectives. Therefore, understanding a user’s learn- 
ing needs from a microscopic view and predicting knowledge 
concepts that the user might be interested in are important. 


In this work, we focus on predicting and recommending 
knowledge concepts that might be interesting to users on 
MOOC platforms. Based on the interaction history be- 
tween users and concepts (i.e., a user has interacted with 
a concept if the user has learned that concept), traditional 
recommendation approaches such as collaborative filtering 
(CF) — which recommends similar items (concepts) based 
on a user’s interaction history or interesting items from sim- 
ilar users — can be applied. However, the sparsity of user- 
item (user—concept) relationships can limit the performance 
of CF-based methods. In addition to users and concepts, 
MOOC platform data normally contain other entities such 
as courses, videos, and teachers as well as the relationships 
among those entities. 


To cope with the sparsity problem, we model those enti- 
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Figure 1: Different types of entities and relationships be- 
tween two entities in MOOCs. A user u is interested in a 
concept if wu has learned or is going to learn it in the future. 


ties and their relationships as a heterogeneous information 
network (HIN) consisting of the entities and relation- 
ships inspired by |13|, which can be used for learning user 
(concept) representations/embeddings by exploring indirect 
user-user (concept-concept) relationships with Graph Con- 
volutional Networks (GCNs). Figure{Ilillustrates such a HIN 
which we discuss in detail in Section [3] For example, one 
can derive a homogeneous user graph based on an indirect 
path in the HIN, e.g., a graph with users and edges between 
two users if they have taken the same course. Given such 
a homogeneous graph, traditional GCNs can be applied to 
the graph to learn the representations/embeddings of users 
and concepts with respect to the chosen path. 


Based on different indirect paths chosen, we can derive var- 
ious user (concept) representations, and those representa- 
tions of users (concepts) regarding different paths can be 
aggregated, e.g., using the mean of those representations. 
Instead of the straightforward mean aggregation, we propose 
and investigate different attention mechanisms to derive ag- 
gregated user (concept) representations based on different 
paths. The intuition behind using an attention mechanism 
is that different paths might have different importance for 
each user. Afterwards, those learned user and concept rep- 
resentations can be used for predicting the preference scores 
of concepts for recommendations. Our contributions in this 
work are as follows: (1) We propose an end-to-end frame- 
work? ] for predicting and recommending knowledge concepts 
of a user’s interest in Section [4] (2) We investigate two at- 
tention mechanisms for aggregating information from differ- 
ent meta-paths (the definition can be found in Section [3) 
to derive user and concept representations. We then incor- 
porate those representations into our extended matrix fac- 
torization framework for predicting the preference score of 
a concept with respect to a user; (3) Finally, we evaluate 
our approach with several baselines and state-of-the-art ap- 
proaches in terms of well-established evaluation metrics, and 
show the effectiveness of our proposed approach in Section|6] 


2. RELATED WORK 


Recommender Systems and User Modeling on MOOC 
Platforms. There has been growing interest in recommender 
systems on MOOC platforms since 2013 with respect to dif- 
ferent aspects such as course, video, and learning paths 
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[9| [23]. For instance, the authors in proposed 


YouEDU, which is a pipeline for classifying MOOC forum 
posts and recommending instructional video clips that might 
be helpful for resolving confusion detected in those posts. 
In [21], the authors showed that peer recommendations can 
improve users’ engagement significantly in the context of a 
Project Management MOOC. Dai et al. [9] proposed analyz- 
ing course content for recommending personalized learning 
paths on MOOC platforms. Khalid et al. provides a 
comprehensive survey on recent advances regarding differ- 
ent recommender systems in the context of MOOCs. More 
recently, researchers have started modeling user interests in 
the context of MOOCs while user modeling has been widely 
studied in other domains such as social media [42]. For ex- 
ample, Li et al. investigated the impact of acquiring 
user interests via surveys or questionnaires on course rec- 
ommendations. In 2], the authors proposed LeCoRe which 
exploits user interest modeling for recommending courses as 
well as similar users for promoting peer learning in enterprise 
environment. Gong et al. argued that course recommen- 
dations overlook user interests regarding specific knowledge 
concepts, and studying users’ online learning interests from 
a microscopic view and recommending knowledge concepts 
can capture user interests better and provide the flexibility 
of choosing learning resources of their interest. In this work, 
we also focus on the microscopic view for knowledge concept 
recommendations. 


Recommendation Approaches with HIN. The basic idea 
of early recommendation approaches with HIN is to leverage 
path-based semantic relatedness between users and items 
over HINs, e.g., leveraging meta-path-based similarities for 
recommendation [47]. For example, Shi et al. 
proposed predicting item ratings based on those from similar 
users measured via different meta-paths. With the advances 
of graph representation learning, the authors in pro- 
posed using pre-trained user and item embeddings based on 
meta-path information with random walk, and incorporated 
those pre-trained embeddings as features into an extended 
matrix factorization framework. The most similar work to 
ours is Gong et al. [13], which is one of the first works 
for recommending knowledge concepts on MOOC platforms 
in a heterogeneous view. The authors showed that their 
proposed approach outperforms other CF-based baselines as 
well as metapath2vec [17], which uses learned node represen- 
tations of a given HIN for knowledge concept recommenda- 
tions by measuring the similarities between two nodes. Our 
work differs from in several aspects. First, we formu- 
late interacted concepts for each user as implicit feedback 
while treated the number of clicks as ratings and for- 
mulated the problem as rating prediction for recommending 
top-k unknown concepts with higher ratings. Secondly, we 
investigate different attention mechanisms including the one 
incorporating the latent features of users (items) from ma- 
trix factorization. Thirdly, the prediction layer (Eq. [6) for 
estimating the preference score of a concept is different from 
which uses the user (item) representations as features 
for the final prediction. 


3. PRELIMINARIES 


In this work, we consider the task of predicting and recom- 
mending concepts that a user might be interested in based 
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on their learning history, which includes a set of learned 
concepts and their contextual information such as courses, 
videos, etc. With n users U = {u1,--- ,un}, and m con- 
cepts C = {c1,--- ,Cm}, we define an implicit feedback ma- 
trix R € R”*™ with each entry ru, = 1 if u has learned 
c and ry,- = O otherwise. The task can be framed in the 
context of HIN which is denoted as G = {V, €} consisting 
of a object set V and a link set €. A HIN is also associated 
with an object type mapping function ¢: V > O and a link 
type mapping function w:€ > R. O and R denote the sets 
of predefined object and link types, where |O|+|R| > 2 [33]. 
The MOOC data in our study can be represented as a HIN. 
The HIN consists of six types of entities such as user, con- 
cept, video, course, school, and teacher. In addition, there 
is a set of links describing the relationships among those 
entities. On top of the definition of HIN, the concept of 
network schema is used to describe the meta structure of a 


network 2 


The network schema is denoted as S = (O, R). It is 
a meta template for an information network G = {V, €} 
with the object type mapping ¢: V > O and the link type 
mapping w : € > R, which is a directed graph defined over 
object types O, with edges as links from R. Fig. [1]shows the 
network schema of our MOOC dataset with the six different 
entity types and the semantic links between them. Given 
the network schema, we can extract semantic meta-paths 
between a pair of entities. A meta-path can be formally 
defined as follows: 


A meta-path MP is defined on a network schema S = 
(O, R) and is denoted as a path in the form of O; mia 


R R : : : : 
Oz —> --- —4 O141, which describes a composite relation 
R= R,0 R20---o R; between object O; and O1+1, where o 
denotes the composition operator on relations. 


4. PROPOSED APPROACH 

In this section, we introduce our proposed approach MOOCIR 
(MOOC Interest Recommender) based on meta-paths in the 
MOOC HIN. In high level, our approach extends the matrix 
factorization (MF) gu. = x2 ze, where YJu,c denotes the pre- 
dicted preference score of concept c with respect to user u, 
and x, and z-. refer to latent features of u and c, respec- 
tively. We extend the MF with user (concept) represen- 
tations/embeddings that are learned by applying GCNs to 
meta-path-based graphs. Fig. [2] shows an overview of our 
approach, which consists of four main components. In the 
following, we describe each component in detail. 


Table 1: Meta-paths selected for extracting user-user and 
concept-concept relationships. 


Type Meta-path 


=I 
user — concept —> user 
-1 
User user — course —> user 
: -1 
user — video —> user 


—1 —1 
user — course — teacher —> course —> user 


=I 
concept — user ——> concept 
Concept P P 


-1 
concept — course ——> concept 
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Figure 2: Overview of our proposed approach MOOCIR. 


Meta-path selection. As discussed in Section|3] meta-paths 
provide the capability to derive entity-entity relationships 
through those paths. Similar to previous studies 31], 
we consider user-user and concept-concept relationships via 
different meta-paths. To fairly compare with in our ex- 
periments, we use the same set of meta-paths used in 
for our study. Table [1] summarizes six meta-paths used for 
our work where four paths for users and two for concepts. 
For each meta-path, a homogeneous graph with respect to 
users (concepts) can be extracted, which is depicted as its 
corresponding adjacency matrix in Fig. As one might 
expect, each entry in the adjacency matrix A regarding a 
meta-path is equal to one if two users (concepts) can be 
connected via that meta-path, and zero otherwise. After- 
wards, we can learn user (concepts) representations for each 
meta-path using GCNs. 


Graph Convolution Networks (GCNs). GCNs learn node 
representations of a graph by inspecting neighboring nodes. 
In this work, we adopt the following layer-wise propaga- 
tion rule to learn user (concept) representations /embeddings 
with respect to a meta-path. 


ht) = g(Ph'w’) (1) 


where g(-) is an activation function which we use ReLu 
here. P = D~'A where D is the diagonal node degree 
matrix of A to normalize the matrix A, and A=A+lTis 
an adjacency matrix with self-loops in a graph based on a 
specific meta-path. W_' refers to a trainable weight matrix at 
layer | for all nodes. h° can be fed with features of each node 
if there is a set of features for each node or can be initialized 
and learned afterwards as well. The output representation of 
the last layer can be used as user (concept) representations. 
For example, when | = 2, the representation of a user u 
for a meta-path MP, will be eM Pi = ho wp, where ho wp, 
is the output of the last layer of GCNs for the meta-path 
MP, with respect to u. In our study, we use a single layer 
GCN where h° is initialized randomly and learned during 
the training process, but one can easily extend it with more 
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layers or using existing features for h°. 


Attention. Attention mechanism is motivated by how 
we pay visual attention to different regions of an image or 
relevant words in one sentence, and has been used widely to 
advance various fields such as natural language processing 
and recommender systems (37) [13]. In our context, different 
meta-paths can have different importance with respect to 
each user, and incorporating the importance of each meta- 
path differently for each user can be beneficial when aggre- 
gating user representations from different meta-paths, i.e., 


MP, 
{eV e, '“?'} > e,. In our work, we apply the atten- 
tion mechanism from [6| for our context as follows: 


qVF — exp(Vi.o(Wuen“*)) 
u MP; 
yelp] CtP(Vi 7 (Wueu *)) 


(2) 


where the output a2’” indicates the weight (or importance) 
for eM@Pi, and V2 and W, are trainable matrices for users. 
The attention mechanism can be formulated in the same 
manner for concepts. Next, the user representations coming 


from different meta-paths can be aggregated as follows: 


ee ea (3) 


jE|MP| 


The above-mentioned attention mechanism takes into ac- 
count different meta-paths but does not consider any con- 
text in the extended MF, which can be the latent features 
of users and concepts for MF. Therefore, we also investi- 
gate the following attention mechanism which considers the 
latent features of user x,, which has not been explored in 
previous studies. In this case, a meta-path based embedding 
ei and x, are concatenated together when calculating the 
attention scores as follows. 

. exp(Vi.o(Wu [ew ; f(xu)])) (4) 
uu A 

Vielarr| etP(WEo(Waleu; f(%u)))) 


where f(xXu) applies non-linearity with a single layer feed- 
forward neural networks to xu instead of using it directly, 
which is inspired by where the authors showed that non- 
linear fusion is required when combining latent features from 
matrix factorization and entity embeddings from GCNs. Af- 
terwards, the final user representation can be obtained in the 
same manner as Eq. 


a Sf ee” (5) 


jE|MP| 


The attention mechanism can be formulated in the same 
manner for concepts. 


Prediction. Given those learned user and concept repre- 
sentations/embeddings e, and e-. The preference score of 
a concept c for a user u can be calculated as follows by ex- 
tending the matrix factorization framework: 


Gu,e =, X1 Ze + ye e, Me. + be (6) 


where Yu, is the preference score, x, and Z. are the latent 
features for the matrix factorization, and b, is a bias term. 
In addition, M is a trainable matrix to let e, in the same 


space with e., and y is a trainable parameter for the trade- 
off between the prediction scores from matrix factorization 
and the user and concept embeddings. 


4.1 Training Details 


Loss function. We use the Bayesian Personalized Ranking 
(BPR) which has been widely used for recommender sys- 
tems with implicit feedback 27]. The intuition behind 
BPR is that a learned concept for a user should be ranked 
higher (with a higher score) compared to a random one in 
the list of concepts with which the user has not interacted, 
which can be formulated as follows: 


L= S$) =In(o(Guis)) +r (1010? (7) 


(u,t,j)EDs 


where (u,i,7) refers to a triplet including a user wu, an in- 
teracted concept 7 and an unknown concept j for the user. 
Yuij = Yui — Yuj Measures the preference difference between 
the interacted concept and the unknown one, o denotes the 
sigmoid function: s(x) = eae A is the regularization pa- 
rameter for the £2 norm, and © denotes the set of parame- 
ters to be learned. The training set D, can be constructed 
by paring an unknown concept randomly with an interacted 


concept in the training set of a user. 


To learn the parameters of our proposed approach for min- 
imizing the loss in Eq.|7| we use a mini-batch gradient de- 
cent with 1,024 as the batch size, and use the Adam update 
rule to train the model using the training set. In ad- 
dition, the learning rate is set as 0.01, the regularization 
parameter X is set as le — 8, and the dimension of latent 
features for MF and that of user (concept) embeddings are 
set as 30 and 100 respectively as in [73]. 


To overcome the overfitting problem, we further construct a 
validation set by using the last interacted concept for each 
user, and randomly pair each known concept with 99 un- 
known concepts. We run 500 epochs where the convergence 
is observed, and monitor the performance of evaluation met- 
rics (see Section |5) on the validation set. At the end, we 
choose the best-performing model on the validation set in 
terms of MRR (Mean Reciprocal Rank), which is one of 
the evaluation metrics measuring how well a ground truth 
concept is ranked in the corresponding set of 100 concepts. 
Any other evaluation metric can be used for choosing the 
best-performing model as well based on the preference for a 
specific metric. 


5. EXPERIMENTAL SETUP 


MOOC Dataset. We use the MOOCCube dataset from 
the XuetangX platform for our experiments. The MOOC- 
Cube dataset is one of the largest and comprehensive MOOC 
datasets, and provides rich information about MOOCs and 
user activities on the platform from 2017 to 2019 [39]. Each 
course or video has a set of covered knowledge concepts in 
the dataset. In this work, we use user activities from 2017- 
01-01 to 2019-10-31 for training and those from 2019-11-01 
to 2019-12-31 for testing. We limit users who have learned 
concepts in both training and testing periods and have at 
least one new concept (which did not appear in the training 
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Table 2: Statistics of the MOOCCube dataset for experi- 
ments. 


Entities Statistics Relations Statistics 
users 2,005 user-concept 930,553 
concepts 21,037 user-course 13,696 
courses 600 course-video 42,117 
videos 22,403 teacher-course 1,875 
schools 137 video-concept 295,475 
teachers 138 course-concept 150,811 


period) in the testing period. Overall, the dataset consists of 
2,005 users 21,037 concepts, 600 courses, 22,403 videos, 137 
schools, 138 teachers, and the relationships among those en- 
tities. In total there are 930,553 interactions between users 
and concepts with 858,072 interactions in the training set 
and the rest (72,481) in the test set. The overall statistics 
of the dataset are presented in Table[2| 


Evaluation Metrics. We evaluate the top-k predictions of 
concepts for users with the following widely used evaluation 
metrics where k is set to 5, 10, and 20. We calculate all 
metrics for each set of 100 concepts (with one interacted 
and 99 unknown) in the test set. For each interacted con- 
cept with respect to a user u, we generate the corresponding 
recommendation list Ry, = {ru,r2,...,rk} where ri, indi- 
cates concept ranked at the i-th position in R, based on 
the predicted scores of those concepts. 


Hit Ratio of top—k concepts (HR@k) measures the fraction 
of relevant concepts in the test set that are in the top—k con- 
cepts of the recommendations: HR@k = % D>, I(|RuNTul) 
where N is the total number of sets for testing, [(a) is an 
indicator function which equals one if « > O and equals 
zero otherwise. Normalized Discounted Cumulative Gain 
(nDCG@k) takes into account rank positions of the rele- 
vant concepts, and can be computed as follows:nDCG@k = 

T(r. T ul) — 
gz DCG@k = ee, : logs (j+1) : 
score obtained by an ideal top-k ranking which serves as 
a normalization factor. Mean Reciprocal Rank (MRR) is 
the average of the reciprocal ranks of positive concepts: 
MRR= + el 1 _ where rank; refers to the rank po- 


rank; 
sition of the one interacted concept in the corresponding set 
of 100 concepts with the rest of unknown ones. 


where Z denotes the 


We use the paired t-test for testing the significance where the 
significance level of a is set to 0.05 unless otherwise noted. 


5.1 Compared Methods 


To better understand and investigate the contribution of 
each component and the performance with the two atten- 
tion mechanisms introduced in Section [4] we first compare 
several variants of our approach. MOOCIR,1 denotes our ap- 
proach with the attention mechanism only considering differ- 
ent meta-paths using Eq. 3 MOOCIRa2 refers to our approach 
with the attention mechanism incorporating the latent fea- 
tures of users (concepts) using Eq. |4] MOOCIR,- is a variant 
of our approach without any attention, i.e., different meta- 
paths are treated equally and the representations learned 
from those paths are averaged. MOOCIR,»,; refers to a variant 
without the matrix factorization part for prediction in Eq. [6] 


which only uses meta-path-based user and concept represen- 
tations for predicting the preference score of a concept. 


Next, we compare MOOCIR with the following baselines and 
state-of-the-art methods to evaluate the performance of rec- 
ommending knowledge concepts for users. TopPop is a straight- 
forward baseline method which ranks concepts based on 
their popularity. Here, the popularity of a concept can be 
measured based on the number of users that have learned 
the concept. MFBPR is a matrix factorization approach 
which optimizes a pairwise ranking loss for the recommen- 
dation task as our approach but without meta-path-based 
representation learning. That is, the second component in 
Eq. |6] based on user (concept) representations is removed. 
FISM |17| is an item-to-item collaborative filtering approach 
which provides recommendations based on the average em- 
beddings of all interacted concepts and the embeddings of 
the target concept. NAIS is also an item-to-item collabo- 
rative filtering approach, but with an attention mechanism, 
which is capable of distinguishing which historical items in a 
user profile are more important for a prediction. We use the 
author’s implementation for both NAIS and FISH} metap- 
ath2vec [17]. metapath2vec is a meta-path-based represen- 
tation learning model which leverages meta-path-based ran- 
dom walks to construct the heterogeneous neighborhood of 
a node and then leverages a heterogeneous skip-gram model 
to learn node embeddings. We use the StellarGraph im- 
plementation of metapath2vec for our experiment in which 
the parameters of metapath2vec are set the same as in 
except the number of random walks is set as 500 instead 
of 100(°| ACKRec also models the MOOC dataset as a 
HIN and extracts user (concept) representations from the 
same set of meta-paths in Table [I] However, ACKRec treats 
the problem as rating prediction task where the rating of 
a concept for a user is the number of interactions between 
the user and the concept. Also, it exploits user and concept 
representations as features while extending the matrix fac- 
torization framework. We use the author’s implementatior|’| 
for our experiments. MFBPR and those MOOCIR variants are 
implemented using Tensorflow [2]. All experiments are run 
on an Intel(R) Core(TM) i5-8365U processor laptop with 
16GB RAM, and MOOCIR variants take less than two days 
for training. 


6. RESULTS 
Table[3]summarizes the results using the variants of MOOCIR. 
As we can see from the table, MOOCIR,,- — which uses user 


and concept representations learned based on meta-paths 
with the HIN but without the matrix factorization compo- 
nent — provides worse performance compared to the other 
variants. The results indicate that extending the matrix 
factorization is necessary for MOOCIR. 


Next, we compare MOOCIR,- and the variants with atten- 
tion mechanisms (i.e., MOOCIR,; and MOOCIR,2). We observe 
that both MOOCIRa: and MOOCIRa2 outperform MOOCIR,- in 
terms of all evaluation metrics, which shows that using at- 


https: //github.com/AaronHeee/ 
eural-Attentive-Item-Similarity-Mode 
OW 


that using 1000 random walks took more than 
10 days for training and did not improve the performance 


compared to using 500. 
https://github.com/JockWang/ACKRec 
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Table 3: Performance of several variants of our proposed 
approach in term of different evaluation metrics with the 
best-performing scores in bold. 


HR nDCG 

k=5 10 20 k=5 10 20 aie 

MOOCIR,,¢ 0.676 0.812 0.906 0.499 0.543 0.567 0.468 
MOOCIR,- 0.701 0.832 0.920 0.513 0.556 0.578 0.477 
MOOCIRai 0.704 0.836 0.922 0.520 0.562 0.584 0.484 
MOOCIR,2 0.703 0.838 0.922 0.517 0.561 0.583 0.482 


Attention weights for different meta-paths 


2 


Meta-paths for users 
3 


0 10 20 30 40 50 60 70 80 90 
Randomly selected 100 users 


Figure 3: Attention weights of different meta-paths for 100 
randomly chosen users learned using MOOCIR,:; where we ob- 
serve different weights of meta-paths for each user. In this 
heatmap, a darker cell indicates a higher attention weight. 


tention can indeed improve the performance and different 
meta-paths have different importance for deriving user (con- 
cept) representations. This can be verified further by inves- 
tigating learned attention weights for different meta-paths 
in MOOCIR as well. For example, Fig. |3} shows a heatmap 
regarding the learned attention weights for 100 randomly 
selected users using MOOCIR. In the figure, z-axis refers to 
the 100 users and y-axis indicates the attention weights for 
the four different meta-paths for users described in Table [I] 
in Section [4] From the figure, we can notice that the first 


meta-path (i.e., user — concept = user) overall has a 
higher weight compared to others. In addition, we observe 
that the attention weights vary across users, which indicates 
the importance of each meta-path varies for different users. 


Finally, by comparing the two different attention mecha- 
nisms, we observe that the one incorporating the latent fea- 
tures of users and concepts (Eq. does not improve the 
performance compared to the simpler one (Eq. |3), which 
is different from our assumption. Instead, we observe that 
MOOCIRa2 performs significantly worse than MOOCIRa1 in terms 
of HR@10 and HR@2Z0 for the users who have interacted 
with a limited number of concepts. Table [4] shows the per- 
formance for three groups of users with less than 150, 350, 
and 550 concepts, respectively. As we can see form the 
figure, MOOCIR,; outperforms MOOCIR,2 significantly for the 
first group of 353 users. The results suggest that fusing in- 
formation from the latent features of users (concepts) into 
the attention mechanism is a non-trivial task, and other ap- 


Table 4: Results of HR@10 and HR@20 for MOOCIR~1 and 
MOOCIR,2 for three groups of users (G150, G350, G550) with 
less than 150, 350, 550 concepts in the training set. 


HR@10 HR@20 
G150 G350 G550 G150 G350 G550 
0.806 0.830 0.851 0.894 0.911 0.927 
0.801 0.829 0.852 0.886 0.908 0.925 


MOOCIRat 
MOOCIR,2 


Table 5: Performance of MOOCIR,1 and compared methods in 
term of different evaluation metrics with the best-performing 
scores in bold. 


HR nDCG 

k=5 10 20 k=5 10 20 ME 
TopPop 0.486 0.629 0.767 0.343 0.390 0.425 0.332 
MFBPR 0.668 0.811 0.907 0.481 0.527 0.552 0.448 
FISM 0.584 0.701 0.800 0.438 0.476 0.501 0.418 
NAIS 0.568 0.691 0.811 0.420 0.461 0.491 0.403 
metapath2vec 0.642 0.773 0.873 0.468 0.511 0.537 0.440 
ACKRec 0.659 0.764 0.842 0.503 0.538 0.557 0.475 
MOOCIRai 0.704 0.836 0.922 0.520 0.562 0.584 0.484 


proaches should be investigated in the future. 


Overall, MOOCIRa1 provides the best performance among all 
variants. In the following, we discuss the performance of 
MOOCIRa1 compared with other baselines and state-of-the-art 
methods. 


Table [5] shows the performance of MOOCIR,; and compared 
methods. We first observe that all the other methods out- 
perform TopPop which is a baseline method recommending 
popular concepts. For example, MOOCIR,; and ACKRec im- 
proves MRR over TopPop 45.8% and 43.1%, respectively. 
Among all the compared methods in Table[5] MOOCIR,1 pro- 
vides the best performance followed by ACKRec, MFBPR, and 
metapath2vec. ACKRec performs best in terms of nDCG 
and MRR, and MFBPR performs best in terms or HR among 
compared methods. In detail, a significant improvement of 
MOOCIR,; over ACKRec in MRR (+1.9%), nDCG@5 (+3.1%), 
nDCG@10 (+4.5%), +nDCG@20 (4.8%) can be noticed 
(a < 0.01). Compared to MFBPR, MOOCIR,: improves the 
AR scores 6.7%, 9.2%, and 9.4% when k =5, 10, 20, respec- 
tively (a < 0.01). The two item-item CF methods (FISM 
and NAIS) do not perform well compared to MFBPR and ACK- 
Rec. One possible explanation might be due to the sparsity 
of the dataset, which makes that deriving item-item similar- 
ities based on interacted users for each item is challenging 
and limits the performance. 


Those results indicate that the proposed approach MOOCIRa1 
can achieve competitive performance in terms of those evalu- 
ation metrics for top-k concept recommendations compared 
to the baselines and state-of-the-art methods. 


7. CONCLUSIONS AND FUTURE WORK 


In this paper, we presented MOOCIR for predicting and recom- 
mending concepts that might be of users’ interest on MOOC 
platforms. The comparison of MOOCIR variants in Section [6] 
shows that extending the matrix factorization with user and 
concept representations learned from different meta-paths 
and using attention for deriving those representations play 
crucial roles in achieving better performance. In addition, 
the results compared to other baselines and state-of-the- 
art methods indicate that MOOCIR,1 can improve the per- 
formance of predicting and recommending concepts signifi- 
cantly. The comparison between the two introduced atten- 
tion mechanisms (Eq.[3]and[4) suggests that a more compre- 
hensive approach is required while fusing the latent features 
of users and concepts into the attention mechanism, which 
will be investigated in the near future. 
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ABSTRACT 


The assessment of program functionality can generally be 
accomplished with straight-forward unit tests. However, as- 
sessing the design quality of a program is a much more dif- 
ficult and nuanced problem. Design quality is an important 
consideration since it affects the readability and maintain- 
ability of programs. Assessing design quality and giving 
personalized feedback is very time consuming task for in- 
structors and teaching assistants. This limits the scale of 
giving personalized feedback to small class settings. Fur- 
ther, design quality is nuanced and is difficult to concisely 
express as a set of rules. For these reasons, we propose a 
neural network model to both automatically assess the de- 
sign of a program and provide personalized feedback to guide 
students on how to make corrections. The model’s effective- 
ness is evaluated on a corpus of student programs written 
in Python. The model has an accuracy rate from 83.67% to 
94.27%, depending on the dataset, when predicting design 
scores as compared to historical instructor assessment. Fi- 
nally, we present a study where students tried to improve 
the design of their programs based on the personalized feed- 
back produced by the model. Students who participated in 
the study improved their program design scores by 19.58%. 


Keywords 


Assessment, neural networks, intelligent tutoring 


1. INTRODUCTION 


Recently there has a been a lot of work in the development 
of tools for education in programming and computer science. 
Specifically there are many systems for intelligent tutoring 
which are designed to help students learn how to solve a 
programming challenge. The tutoring involved is primarily 
focused in suggesting functional improvements, that is, how 
to finish the program so that it works correctly. 


Intelligent tutors such as [4] uses reinforcement learning to 
predict a useful hint in the form of an edit to a student’s 
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program that will get them one step closer to the goal of 
a functioning program. It uses histories of edits made by 
students, starting with a blank slate and ultimately termi- 
nating with a functional program to train the model. The 
system is based on Continuous Hint Factory [9] which uses 
a regression function to predict a vector that represents the 
best hint then translates that vector into a human-readable 
edit. Similarly [11] used a neural network to embed pro- 
grams and predict the program output. Using that model of 
the program output, an algorithm was developed to provide 
feedback to the student on how to correct their program. 
Also, [16] use a recurrent neural network to predict student 
success at a task given a history of student submissions of 
their program for evaluations. 


All these systems model student programs from Hour of 
Code [2]. Hour of Code is a massively open online course 
platform that teaches people how to code with a visual pro- 
gramming language. The language is simple and does not 
contain control constructs such as loops. 


Moreover, the combination of language and problem setting 
are simple enough that there is a single or very few func- 
tional solutions for each problem [4]. This level of simplic- 
ity precludes the consideration of program design. However 
for general purpose programming languages such as Python, 
there are many ways of creating functionally equivalent pro- 
grams. It is important for the sake of maintainability, mod- 
ularity, clarity, and re-usability that students learn how to 
design programs well. 


When it comes to the quality of design, there are varying 
standards. Further, some standards are more objective or 
easier to precisely identify that others. For example, the use 
of global variables are both widely recognized as poor design 
and are easy to identify. For some programming languages, 
“linters” exist to apply rules to check for common design 
flaws. For Python, Pylint [12] is a code analysis tool to 
detect common violations of good software design. It detects 
design problems such as the use of global variables, functions 
that are too long or take too many arguments, and functions 
that use too many variables. Pylint is design to enforce the 
official standards of the Python programming community 
codified in PEP 8 [15]. 


There are aspects of good design that are difficult to iden- 
tify. For example, simple logic is a good design idea, but is 
quite nebulous. The complexity of a program’s logic is con- 
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textual, it entirely depends on the problem the program is 
solving. Also, modularity is universally judged as a quality 
of good design, however it is not always clear to what extent 
a program should be made modular. How many functions 
or classes are too many? Again, it depends on the context 
of the problem for which the program is designed. 


In a professional setting, code reviews are often practiced 
to promote quality design that goes beyond the straight- 
forward rules of “linters.” Code reviews are a manual pro- 
cess which require a lot of human effort. A recently de- 
veloped system call DeepCodeReviewer [5] automates the 
code review process with a deep learning model. By using 
proprietary data on historical code reviews taken from a Mi- 
crosoft software version control system, DeepCodeReviewer 
was trained to successfully annotate segments of C# code 
with useful comments on the code’s quality. 


However, to our knowledge, there is no system to perform 
in-depth code analysis for the purposes of evaluating and 
assessing design for general purpose languages in an edu- 
cational context. The process of assessing the design of a 
program is time consuming for instructors and teaching as- 
sistants and it is an important component of complete intel- 
ligent tutoring system. Such a system needs to be adjusted 
or calibrated for the context of particular problems or assign- 
ments since there are important aspects of software design 
are context dependent. Moreover the system needs to match 
the particular standards of an instructor. Hence we propose 
a system that models design quality with a neural network 
trained on previously assessed programs. 


1.1. Our Approach and Contribution 

We propose a design quality assessment system based on a 
feed-forward neural network that utilizes an abstract syntax 
tree (AST) to represent programs. The neural network is 
a regression model that is trained on assessed student pro- 
grams to predict a score between zero and one. Each feature 
the model uses is designed to be meaningful to human in- 
terpretation and is based on statistics collected from the 
program’s AST. We intentionally do not use deep learning 
as it would make the representation of the program difficult 
to understand. Personalized feedback is generated based 
on each feature of an individual program. By swapping a 
feature’s value for an individual program with the average 
feature value of good programs, it is possible to determine 
which changes need to be made to the program to improve 
is design. The primary contributions of this work are the 
following: 


e The first to explicitly predict the design quality of 
programs in an educational setting to the best of our 
knowledge. 


e High efficacy with an accuracy from 83.67% to 94.27% 
with only small amounts of training data required. 


e The first intelligent tutoring system for design quality 
for Python. 


e Personalized feedback without the explicit training or 
annotation. 


Input Hidden Output 
Layer Layer Layer 


Figure 1: The model of program design quality, a feed- 
forward neural network. The “Input Layer” is the feature vec- 
tor created from the AST. The “Hidden Layer” corresponds 
to calculation of x’ specified in Equation 1. Finally, the “Out- 
put Layer” produces a single value, the design score as found 
in Equation 2. 


2. METHOD 


The task is to predict a design quality score for a student 
program written in Python. The score y is a real number 
between zero and one. The program is represented by a 
feature vector # produced by the output of a series of feature 
functions computed from the program’s AST. 


For an AST T, a series of feature functions f;((Z')) output 
is concatenated in to a feature vector % that represents key 
aspects of the program’s design. The model g(#;®) is a 
feed-forward neural network with a single hidden layer. It 
is a regression model that predicts the score y based on the 
feature vector £ and parameters O. 


2.1 Features 

Despite recent advances in deep learning, we chose to repre- 
sent the student program with feature functions computed 
on its AST. Deep learning is highly effective at learning use- 
ful feature representations of everything from images to time 
series to natural language texts. However, deep learning also 
requires large amounts of data and in this setting the quan- 
tity of manual annotated student programs is limited. 


Additionally, AST are a natural and effective means of rep- 
resenting and understanding programs and can be created 
with free, available tools. An AST is an exact representation 
of the source code of program based on the programming 
language’s grammatical structure. Producing an AST rep- 
resentation of a programming language is an essential first 
step in compilers and interpreters. The AST of a program 
contains all the content of its source but also is augmented 
with the syntactic relationships between every element. A 
parser and tokenizer to produce an AST for the Python pro- 
gramming language is provided by its own standard library. 
This makes the AST the natural representation to use, since 
it is free, convenient, exact, interpretable, and does not re- 
quired any additional data. In contrast, deep learning would 
require a large amount of data to effectively reproduce the 
same representation. 


Prior to representation as an AST, a program must first be 
broken into a series of tokens via the process of lexicaliza- 
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Assignment 


Arguments 
values 


Return 


a2 


For-loop 


Load 
values 


Constant 
0.0 


AugAssign BinaryOp 


def avg(values): 


total = 0.0 


for value in values: 


Division 
/ 


total += value 


return total / len(values) 


Figure 2: An abstract syntax tree for a segment of Python code that computes the average of the values in a list. The AST has 
been condensed for the sake of brevity and the restrictions of space. The related code segment is shown in blue. 


tion. Lexicalization is the process of reading a program, 
character-by-character and dividing into work-like tokens. 
These tokens are also assigned a type such as a function call 
or variable reference. The grammatical rules of the language 
are applied to the lexicalized program to create the AST. In 
an AST, the leaf nodes of the tree are the program’s tokens 
while the interior nodes correspond to syntactic elements 
and constructions. For example, an interior node could rep- 
resent the body of a function or the assignment of a value 
to a variable. An example of an AST is found in Figure 2. 


Given that an AST is a complete representation of a pro- 
gram, it is a natural basis for assessing the quality of a pro- 
gram’s design. Deep learning may be able to automatically 
learning the same key syntactic relationships with enough 
data, however this information is simply available via AST. 
Further, features computed from the AST will be human in- 
terpretable unlike a representation produced by deep learn- 
ing. 


The features we created are all based on statistics collected 
from a program’s AST. Some consist of simply counting the 
number of nodes of a given type, for example, the number 
of user defined functions. Other feature functions are based 
of subsections of the AST, such as the number of nodes per 
line or per function. Finally, some features are ratios or 
percentages such as the average percent of lines in a number 
in a function that are empty. All of the features are relatively 
simple and fast to compute, yet generally capture the design 
and quality of a program. Each of the feature functions 
fi(T) we defined can be found in Appendix A. 


2.2 Model 


The model is a feed-forward neural network [13] with a single 
hidden layer and single neuron in the output layer. The 
model’s structure is illustrated in Figure 1. The values of 
the input layer are the feature vector #. Each neuron in the 
hidden layer x; defined with the following equation: 


d 
a, = ReLU ( x ws) 


w=1 


(1) 


where d is the dimension of # and wi,; € © are the param- 
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eters, “weights” of the neuron. We use the ReLU [8] as the 
activation function for the hidden layer neurons. The final 
prediction of the design score is made by the output layer’s 
single neuron: 


(2) 


where d’ is the number of hidden layer neurons and w; € © 
are weights of this neuron. The function sigmoid is used 
because its domain spans from (—oo,0o) but its range is 
[0, 1] which ultimately guarantees the model always outputs 
a valid score. The model is trained with mean squared error 
as the loss function: 


(3) 


where yj. is the ground-truth design score for the k™” instance 
i.e. program and n is the number of instances in the training 
data. The model is trained with the ADAM algorithm [7] 
and each parameter in the model was regularized according 
to their L2 norm [6]. For all our experiments, a hidden 
layer of size 32 was used. The model was trained for 250 
epochs and the model from the best round according to a 
development set was selected for our experiments. 


2.3 Ensemble 

Due to the fact that fitting neural networks to data is a lo- 
cal optimization problem, the effect of initial values of the 
parameters © of the model remain after training. The pro- 
cess of training a neural network will produce a different 
model given the same data. This variation in the results of 
a trained model is particularly pronounced when the train- 
ing data set is relatively small. To address this variation 
and mitigate its impact an ensemble of models can trained, 
each with different initial parameter values. Each model is 
independently trained and a single prediction is made by a 
simple of the average of individual predictions i.e. 


1 m™m 
v= (4) 
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where ™ is the number of models and y, is the prediction 
of the I’” model. For our experiments, an ensemble of 10 
models was used. 


2.4 Personalized Feedback 


The goal of intelligent tutors is to provide personalized feed- 
back and suggestions on how to improve a program. The 
most straight-forward means of providing feedback would 
be to simply predict which possible improvements apply to 
a given program. However, training a model to directly pre- 
dict relevant feedback would require a dataset of program 
with corresponding feedback and such a dataset can be hard 
to find or is expense to construct. 


In order to avoid the need for a dataset with explicit feed- 
back annotation, we use our model, trained on predicting 
design score, to evaluate how changes in program features 
would lead to a higher assessed score. Using the training 
data, we compute an average feature vector Z of all the 
“good” programs i.e. those with a design score greater than 
0.75. To generate feedback for a program, its feature vector 
& is compared to the average Z. For each feature, a new 
vector £’ is created by replacing the feature value x; with 
value with the average’s value x;. This process is setting up 
a hypothesis, what if the program was closer to the average 
“good” program with regards to a particular feature? To 


a 


answer this, the trained model g(Z;@) is used to predict a 
design score for the new vector i.e. y, = g(Z’; 6). By compar- 
ing the original score of the program y with the new score y;, 
the hypothesis can be tested. If the new score yj is greater 
than the original predicted score y, then the alteration of 
x; to be closer to x, is an improvement. Feedback based 
on this alteration is recommended to the student as person- 
alized feedback. Since each feature in # is understandable 
to a human, feedback is given in the form of the suggestion 
to increase or decrease particular features. The suggestion 
for alteration is based on the comparison of x; versus 2,, if 
x; > «;, the feedback of decrease x; is given. In the other 
case, where x; < «x, the feedback is to increase x;. Based 
on the feature and the feedback of increase or decrease, a 
user-friendly sentence is selected from a table of predefined 
responses. For example, if 2; is the number of user defined 
functions and x; = 3, z = 5, and yj > y then the feedback of 
‘@ncrease the number of user defined functions” is created. 


3. EXPERIMENTS 


The system was evaluated in two different experimental set- 
tings. The first evaluation is direct test of the model’s ac- 
curacy on known design scores. For this, several different 
datasets and settings were compared against several base- 
lines. The second evaluation is a small study of how stu- 
dent’s responded to the system’s feedback. Students were 
given feedback on the quality of their programs based on 
the model. They were given the chance to correct their 
programs after receiving feedback and have it manually re- 
assessed. 


3.1 Dataset 

The dataset was collected over three years from an intro- 
duction to computer science course which teaches Python 3. 
It consists of four separate programming assignments which 
involve a wide range of programming skills. The simplest 


is “Travel,” an assignment that involves the distance a ve- 
hicle travelled after going a constant speed for a specified 
duration. There are 118 student programs for “Travel.” The 
next assignment, “Budget” is a budgeting program that lets 
a user specify a budget and expenses and determines if they 
are over or under their budget. 168 student programs were 
collected for “Budget”. The third assignment in the dataset 
consists of creating a program to play “Rock-Paper-Scissors” 
against the computer. For this assignment, there are 111 
student programs. The last assignment is programming the 
classic casino game “Craps” which involves rolling multiple 
dice and placing different types of bets and wagers. This 
assignment has 120 collected student programs. 


All the assignments require the student to write the pro- 
gram from scratch in Python 3. The programs are to have 
a command-line, text-based interface and user validation. 
Students are required to use if-statements, loops, user de- 
fined functions. The “Craps” program also requires the stu- 
dent to do exception handling, and file I/O. A requirement 
of the program was to maintain a record of their winnings 
across sessions of playing the game, hence the results were 
required to be stored to a file. Also, the standards of design 
quality go up as the course progresses and since “Craps” 
is the last assignment, it has the highest standards. Each 
student program has an associated design score that was 
normalized to value between zero and one. 


3.2 Baseline Methods 


The model is compared against a variety of baseline regres- 
sion methods. The simplest is linear regression, which sim- 
ply learns a weight per each feature. Next is a regression 
decision tree which is trained with the CART algorithm [1]. 
It has the advantage over linear regression in that it can 
learn non-linear relationships. Non-linearity means a model 
can learn “sweet-spots” rather than simply having a “more is 
better” understanding of some features. For example, hav- 
ing some modularity in the form of user defined functions 
is good, however, too many is cumbersome. The “correct” 
number of user defined functions likely should fall into a rel- 
atively small range. Model selection on the maximum depth 
of the tree with a development set was used to determine 
that 10 was the best setting. 


However, both of these models have the issue that they are 
not constrained to produce a score between zero and one, 
their prediction can be any real number. Hence another 
baseline method was used, created to be an intermediary 
step between linear regression and the neural network model. 
It is a linear model with a sigmoid transformation which 
guarantees the output be between zero and one. This model 
is effectively the final layer of the neural network model, i.e. 
the neural network without the hidden layer. The model is 
specified by the equation: 


y= o(Sowse) (5) 


This model is also trained with ADAM [7]. All the baseline 
models and the neural network are trained with MSE as the 
loss function. 
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Method Travel Budget RPS Craps Combined 
MSE Accuracy | MSE Accuracy | MSE Accuracy | MSE Accuracy | MSE Accuracy 
Linear Regression 0.009 93.09% 0.032 87.43% 0.038 83.2% 0.043 84.06% 0.027 87.03% 
Decision Tree 0.018 90.10% 0.031 87.26% 0.078 77.33% 0.072 79.41% 0.076 81.42% 
Sig. Linear Regression | 0.022 89.60% 0.046 85.13% 0.086 77.91% 0.063 81.43% 0.070 80.64% 
Neural Network 0.007 93.48% 0.024 88.48% 0.041 83.9% 0.08 79.57% 0.033 85.61% 
Ensemble 0.005 94.27% 0.022 90.14% 0.022 87.66% 0.053 83.67% 0.031 86.99% 


Table 1: Design Score Prediction Results 


3.3 Results 


The model was compared versus each baseline in five differ- 
ent settings: the “Travel”, “Budget”, “Rock-Paper-Scissors”, 
and “Craps” programs, and a combined dataset which in- 
cludes all the programs. The results of the experiments can 
be found in Table 1. Each model is evaluated according to 
two different metrics: MSE and average accuracy. Average 


accuracy is defined as 4 > (1 — |yx — yg). 
k=1 


Overall, the decision tree and the sigmoid-transformed were 
clearly the two worst models. This was surprising since deci- 
sion trees are generally thought to be strictly more powerful 
than linear models. However, decision trees look for highly 
discriminative features to partition the data into more con- 
sistent groups. The under-performance of the decision tree 
possibly indicates that none of the features were especially 
indicative of a good or bad design on their own. Instead, 
the quality of a program is better described by a collection 
of subtle features, which gives credence to the belief that 
design quality is nuanced. 


The reason sigmoid-transformed linear model under-performed 


linear regression was likely due to it being trained with 
ADAM. ADAM does not guarantee convergence to a global 
optimum like the analytical solution to linear regression. 
Apparently the restriction on predictions to be within the 
specified range of zero to one was not important. 


Linear regression did surprisingly well, beating both the neu- 
ral network and network ensemble in the “Craps” and com- 
bined datasets, though barely. In those cases, the differ- 
ence between the ensemble and linear regression was less 
than a percent. This is likely due to the stability of lin- 
ear regression’s predictions. Though linear regression does 
not have the power and flexibility of neural networks this 
can also be a benefit by limiting how wrong their predic- 
tions are. Neural networks and even ensembles can make 
overconfident predictions on outliers or other unusual cases. 
The “Craps” dataset contained the most complex programs 
and it is likely a handful of predictions significantly brought 
down the average. 


The network ensemble outperformed the single neural net- 
work in every case, which is to be expected. The margin of 
improvement of the ensemble versus the single neural net- 
work in accuracy on the four individual program datasets 
ranged from 1% to 4%. The network ensemble did the 
best overall by being the best in most cases or coming in a 
close second in all the other cases. The importance of using 
an ensemble is evident on the “Craps” dataset where the in- 
dividual neural network under-performed significantly. On 


the other datasets, the neural network outperformed linear 
regression by a small margin, but on “Craps” the neural net- 
work model under-performed the linear regression model by 
5%. Again, this is most likely due the instability and vari- 
ability of neural network predictions i.e. small differences 
in features can lead to a large difference in the prediction. 
In the “Craps” dataset, the improvement of the ensemble 
over the single neural network model illustrates the relative 
stability of the ensemble’s predictions. In every case, the 
ensemble is superior to the single neural network and had 
the best overall performance by producing the most accu- 
rate results on three of the datasets and effectively tying for 
the best on the other two. 


One noticeable pattern was that all the models performed 
better on the “Travel” and “Budget” datasets than on the 
“RPS”, “Craps”, and combined datasets. Universally, the 
most difficult dataset was “Craps” which likely lowers the 
accuracy on the combined dataset. Due to the shifting stan- 
dards and expectations of student assignments, a model per 
assignment appears to worthwhile. This is a bit counter- 
intuitive since there are many common standards and ex- 
pectations across assignments. 


Overall, the network ensemble produced reliable, accurate 
results when trained per dataset. The accuracy of the en- 
semble is arguably close to being useful in practical applica- 
tion. Further, comparing the scores of an instructor versus 
another instructor or even against themselves, the rate of 
agreement must be less than 100% and with an accuracy of 
the network ensemble ranging from 83.67% to 94.27%, the 
model’s accuracy is possibly close to a realistic ceiling. 


3.4 Feedback Study 


In order the evaluate the effectiveness of the personalized 
feedback, we conducted a small study on the effect of the 
feedback on the design score of student programs. For the 
“Rock-Paper-Scissors” program the network ensemble was 
used to generate personalized feedback for the student pro- 
grams instead of the usual instructor feedback. The network 
ensemble was the same as used in the design score experi- 
ments, it was trained with prior years worth of student pro- 
grams. Having received the personalized feedback, students 
opted into correcting their program for extra credit on their 
assignment. The feedback was in form of a series of com- 
ments, where each comment was “increase” or “decrease” the 
name of a feature as described in Section 2.4. 


The class is an introduction to computer science course with 
multiple sections and two different instructors. Students 
from both instructors participated in the study. Out of 73 
students enrolled across the sections of the course, 15 stu- 
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dents chose to opt-in. 


The revised programs were assessed again manually for de- 
sign quality and the scores were compared against the origi- 
nals. The design score of the programs started at an average 
of 68.33% and after the feedback and correction the average 
rose to 87.92%, a 19.58% absolute improvement. Using a 
paired t-test, the improvement was judged to be significant 
with a p-value of 0.001. 


The results of the study suggest the feedback was gener- 
ally useful in guiding students to improve the design qual- 
ity of their programs. The improvement was noticeable to 
the instructors anecdotally as well. For example, the usage 
of global variables and “magic numbers” decreased signifi- 
cantly. Though the study does have some caveats including 
its small sample size and opt-in participation. It could be 
that those students willing to opt-in are those most willing 
or able improve with a second chance. 


4. CONCLUSIONS & FUTURE WORK 


Overall, we proposed a neural network model ensemble for 
predict the design quality score of a student program and 
experimentally demonstrated its effectiveness. Further, our 
system provided personalized feedback based on the differ- 
ence between a program’s feature values and the average 
features’ value of “good” programs. A small study provides 
evidence that the feedback was of practical use to students. 
Students were able to improve their programs significantly 
based on the feedback they received. 


There is also evidence that training models per assignment 
is most effective. However, the model needs to be evaluated 
on more programming assignments. Further, there is a pos- 
sibility of utilizing transfer learning [10] to help the model 
learn what is in common across the assignments. 


The feedback given was shown to be effective, but more nu- 
anced feedback could be useful. Specifically, feedback tar- 
geted to individual lines or segments of code would possi- 
bly help students improve their program’s more effectively. 
However, this may require additional supervision i.e. anno- 
tation for explicit training. Active learning [14] or multi- 
instance learning [3] may be alternatives to gathering addi- 
tional annotation. 
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APPENDIX 
A. FEATURE FUNCTIONS 


e The number of functions 

e The number of assignments 
e AST nodes per function 

e Lines of code per function 
e Total lines of code 

e Number of literals 


e The proportion of white-space characters to the total 
number of characters 


e Number of empty lines 

e Deepest level of indentation 
e Number of “if” statements 
e Number of comments 


e Number of AST nodes per lines of code 


e Number of try-except statements 
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e AST nodes per try-except statement 

e AST nodes per “if” statement 

e Number of lists 

e Number of tuples 

e Average line number of literals 

e Average line number of function definition 
e Average line number of “if” statement 


e Ratio of AST nodes inside functions versus total num- 


ber of AST nodes 
e Number of function calls 
e Number of “pass” statements 
e Number of “break” statements 


e Number 


fe) 
fe) 

e Number of “continue” statements 
of global variables 
fe) 


e Number of zero and one integer literals 
e Average line number of “import” statement 
e Number of numeric literals 


e Number of comparisons 


e Number of “return” statements 


e Maximum number of “return” statements per function 


e Maximum number of literals per “if” statement 
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ABSTRACT 


The aim of this work is to provide data-driven insights re- 
garding the factors behind dropouts in Higher Education 
and their impact over time. To this end, we analyzed stu- 
dents’ data collected by a Higher Education Institute over 
the last 11 years and we explored how socio-economic and 
academic changes may have impacted student dropouts and 
how these changes may have been reflected or captured by 
students’ data. To analyze the data, we engineered fea- 
tures that may predict student dropouts on three dimen- 
sions: academic background, students’ performance and stu- 
dents’ effort. Then we carried out a correlation analysis to 
investigate the potential relationship between these features 
and dropouts, we performed a multivariate analysis of vari- 
ance (MANOVA) to investigate whether the engineered fea- 
tures change significantly among student cohorts with dif- 
ferent admission year and, finally, we carried out a regres- 
sion analysis to confirm that the engineered features’ impact 
on predicting dropouts changes over the years. The results 
suggest that the importance of features regarding the aca- 
demic background of students (such as the students’ prior 
experience with the academic institution), and the effort 
students make (for example, the number of days students 
spend on academic leave) may change over time. On the 
contrary, performance-based features (such as credit points 
and grades) do interact with time suggesting that perfor- 
mance measures are stable predictors of dropouts over time. 
On the basis of the findings, we argue that the performance 
of prediction models for assessing students at risk of drop- 
ping out of their studies can be affected by the age of data 
and we outline the possibility of including a forgetting fac- 
tor for non-recent data in order to leverage their impact on 
prediction performance. 


Keywords 
dropouts, feature engineering, predictive modeling, higher 
education 
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Student retention is pivotal for success of an educational 
institute. To understand the reasons behind dropouts, indi- 
vidual cases of students had to be analyzed on one by one 
basis. The advent of information technology, the use of dig- 
ital technologies by educational institutes and the collection 
of rich data regarding students’ background, performance 
and effort offer the possibility of using advanced analytical 
approaches, such as machine-learning in order to identify 
trends and patterns that may indicate students at risk of 
dropping out from their studies [13]. 


To ensure quality education, Higher Education Institutes 
(HEIs) typically offer analytical solutions - such as, learning 
dashboards - to inform stakeholders (for example, program 
directors, academic specialists and instructors) with respect 
to student dropouts [2]. To do so, machine-learning models 
are typically employed to analyze data collected by Study 
Information Systems (SISs) and Learning Management Sys- 
tems (LMSs) and to predict whether a student faces a risk to 
drop out from their studies [1, 4]. This is a well-established 
practice but little research has been carried out with re- 
spect to the temporal aspects of data, such as the age of 
data used to train predictive models for assessing dropouts. 
One may argue that — in terms of predictive performance — 
the more training data, the better. However, our hypothesis 
is that the factors affecting dropouts in Higher Education 
(HE) change significantly over time due to socio-economic 
conditions [16] and to such an extent that data age may 
affect the computational model’s predictive accuracy. 


The goal of this research is to analyze the data collected over 
11 years, 2010 to 2020 from the SIS of a national European 
HEI. The objective is to engineer and identify the important 
log-based features behind dropouts, how these features may 
change over years, and to explore their impact on predicting 
dropouts. The contribution of this work is twofold: 


e to provide insights regarding log-based features that 
may relate to student dropouts in HE; 


e to explore the relationship between the aforementioned 
features and time regarding their impact on dropout 
prediction. 


In the following section we provide a short overview of re- 
lated research, then we present our methodological approach 
and we follow up with the results of our analysis. We con- 
clude with a contextualized discussion on our findings, the 
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practical implications of this work and limitations as well as 
potential future directions. 


2. RELATED WORK 


As dropouts in Higher Education, we identify the cases of 
students who do not successfully complete their studies for 
reasons that indicate lack of motivation and willingness to 
pursue an academic degree. Dropouts in HE is a prominent 
issue with negative impacts for students, and institutions 
that also affects national and international policies’. 


The reasons behind students dropouts can vary between per- 
sonal (for example, students feeling isolated or homesick [8]), 
academic (such as students’ lack of background knowledge 
or study skills [9]) and socio-economical (for example, fi- 
nancial difficulties and cultural adaptation [16, 12]). At the 
same time, factors that relate to the academic institution 
rather than the students themselves (such as the quality of 
studies and resources that the institution offers [3]) can also 
affect student dropouts. Tinto’s theoretical model of stu- 
dents’ dropouts from college [15] identified two dimensions 
as crucial in terms of academic success: student’s charac- 
teristics (such as family background and goals) and stu- 
dent’s experience with the academic system (such as stu- 
dent’s performance and relationship with mentors and col- 
leagues). Crosling et al. [7] attributed student dropouts to 
the services the academic institutes offer to students, such 
as information regarding the admission process, quality of 
the teaching, and assessment and [11] investigated the re- 
lationship between the socio-economic status of a country 
and students dropouts. Other work [10] argued that student 
dropout is often related to a combination of reasons that in- 
clude individual and curriculum-level factors, for example, 
inefficient study skills and inefficient academic or social en- 
vironment. 


In this work, we examine the case of an Estonian HEI. Es- 
tonia, being a relatively new member of European Union, 
is going through social, structural and economic changes in 
many sectors, including higher education. We argue that 
these socio-economic changes that arguably affect student 
dropouts, may also affect the performance of predictive al- 
gorithms that model student dropouts if temporal aspects 
of students’ data (such as, the age of data as depicted for 
example by students’ admission year) are not taken into ac- 
count. 


3. METHODOLOGY 

This research was carried out in an Estonian Higher Educa- 
tion Institute (HEI). Recently, the HEI launched an initia- 
tive aiming to support students in successfully completing 
their studies. To do so, the HEI designed a learning analyt- 
ics (LA) dashboard that provided information to academic 
stakeholders (in this case, program directors and academic 
specialists) regarding potential reasons that may contribute 
to dropouts in their programs and suggestions concerning 
appropriate feedback and support that they could offer to 
students-at-risk. To provide this information, the LA dash- 
board used students’ data collected by the SIS of the HEI — 


lhttp://publications.europa.eu/resource/cellar/ 
d9de3b17-Odcf-11e6-ba9a-01aa75ed71a1.0001.01/D0C\ 
1 


with the students’ informed consent — throughout the stu- 
dents’ academic career [5]. 


To identify students at risk from dropping out from their 
studies, the LA dashboard used a predictive model (de- 
scribed in [5] that assessed dropout risk on three dimensions: 
academic background of the student, student’s performance, 
and effort. The separation of the dimensions would help the 
institute to link dropout factors directly to students’ co- 
horts. Each dimension was defined based on pre-selected 
engineered features from the SIS database. In this work, we 
used data collected for students on the bachelor level from 
2010 to 2020) to explore whether the predictive features used 
by the model change over time, to what extent, and what is 
the impact of this change on dropout prediction. 


3.1 Method of Study 


Our hypothesis was that the performance of dropout pre- 
dictive models that were trained with students data col- 
lected over various admission years, will not be consistent 
over time; the reason for that being that the predictive fea- 
tures change significantly over time. For the purpose of our 
research, we followed a three-step approach: 


e we performed a correlation analysis to explore indica- 
tions of potential relationships between log-based, en- 
gineered student features and dropouts per admission 
year; 


e we carried out a MANOVA to establish that the log- 
based features retrieved from the correlation analysis 
vary significantly over student cohorts of different ad- 
mission years; 


e we performed a regression analysis with interaction 
terms to investigate the effect of the log-based features 
— retrieved from correlation analysis and MANOVA — 
on dropout prediction for student cohorts admitted on 
different years. 


As a proof of concept, we trained a regression model as a 
binary classifier to predict student dropouts using the engi- 
neered features that we acquired from the aforementioned 
process. Then, we tested the performance of the classifier 
on unseen data. An overview of the method of study is 
presented in Figure 1. 


3.2 Description of data 

In this work, we used data of bachelor-level students that the 
HEI collected using the Study Information System (SIS) over 
a period of 11 years (from 2010 to 2020). The data was orig- 
inally organized in 4 tables containing information regarding 
students’ academic background, demographics, study place, 
and study info data. 


In the SIS database, each student and each study place (or 
else, curriculum enrollment) have different unique identifiers 
(“person ID” and "study place ID”, respectively). This con- 
sequently means that the relationship between students and 
curriculum enrollments is 1 to N - that is, one student may 
be enrolled in multiple curricula at the same time. In order 
to create one working dataset, we merged the four database 
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investigate impact of 
admission year on 
dropout prediction 
from engineered 
features 


explore relationship | explore relationship + 

between engineered ; between engineered ; 

features and dropouts} _—‘ features and 
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proof of concept 


Train predictive 
model on training 
set 


Correlation 
Analysis 


Regression Analysis 


MANOVA with Interaction Term 


Test predictive 
model on unseen 
data 


Figure 1: A graphical representation of the method of study, 
including the three-step analytical approach and the proof- 
of-concept example. 


tables using a combination of the unique keys ”person ID” 
and "study place ID”. Using this dataset, we engineered a set 
of features that can potentially describe student’s academic 
profile on three dimensions — that is, student’s academic 
background, student’s academic performance, and student’s 
academic effort — and may provide insights regarding stu- 
dents who may be at risk of dropping out from their stud- 
ies. Following the recommendation of the ethics committee 
of the HEI, we excluded information that could be linked 
to students identity and demographic background to avoid 
potential discrimination, gender or racial bias. 


As dropouts, we identified students who terminated their 
studies due to reasons (as recorded by the SIS of the HET) 
that may indicate lack of motivation, or unwillingness to 
pursue an academic degree. Students can dropout at any 
point during the academic year, but the HEI records stu- 
dents’ ”exmatriculation” in the beginning of every semester. 
In total, the dataset consisted of 9623 students who are en- 
rolled in the bachelor programs offered by the HEI. Out of 
these students, 3428 students dropped out at some point 
during their studies before they acquire an academic degree. 
Figure 2 shows the distribution of the dropout ratio — that 
is, the number of students who dropped out over the whole 
bachelor-level student population per admission year, over 
11 years. For Year 2020 we only obtained data for the first 
academic semester (February to June). 


3.3 Features Engineering 
For each dimension of a student’s academic career, we engi- 


neered a set of features from data recorded from the SIS of 
the HEI. In brief: 


e Academic Background: The SIS records information 
regarding students’ earlier academic background when 
students enroll to a study program offered by the HEI. 
We engineered features related to students earlier aca- 
demic degrees, the admission score, admission special 
conditions (for example, good results in Olympiads, 
high scores in the academic aptitude test) and the 
number of previous enrollments to study programs of- 
fered by the same HEI. 


e Performance: Here, we engineered features related to 
students’ performance as depicted by grades and awarded 
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Figure 2: The dropout ratio, that is the ratio of the students 
who dropped out over the whole bachelor-level student pop- 
ulation per admission year, from 2010 to 2020 


credits throughout the study program. Performance- 
related features include credits earned, grades, and cu- 
mulative positive and negative study results (that is, 
numbers of passed and failed courses). 


e Effort: Here, we considered features that can represent 
a student’s overall effort during their studies. Some of 
these features are, the number of days a student spends 
on academic leave, the number of credits the student 
cancelled throughout the semester, the number of the 
registered courses during a semester and information 
about student’s allowances and achievement stipends. 


The complete set of features per dimension along with a 
short description for each is presented in the appendix (Table 
4). 


4. RESULTS 


The results are presented per each step of the analytical 
process: the correlation analysis, the MANOVA and the re- 
gression analysis. For simplicity, we only report statistically 
significant findings at the p < 0.05 level. Then, we re- 
port our exploratory findings from the prediction example 
as proof-of-concept. 


4.1 Correlation Analysis 

We carried out a correlation analysis (Spearman’s rank-order 
correlation) to explore the potential relationship between the 
engineered features and student dropouts. We only report 
statistically significant correlations at the p < 0.05 signifi- 
cance level with medium and strong correlation coefficients 
(p > |0.3]) (Table 1). The correlation analysis suggests that 
features representing student performance and effort, such 
as the number of credits a student earns or the number of 
courses they register, may relate negatively with the proba- 
bility of dropping out from their studies (that is, the more 
courses they register, the less likely to dropout). One inter- 
esting finding was that the student’s economic support was 
negatively correlated with student dropouts from 2010 to 
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2010 2011 2012 2013 


2014 2015 2016 2017 2018 2019 2020 


Performance Features 


nr.of.courses.with.any.grade -0.58 -0.65 -0.71 -0.59 
credits.earned -0.72 -0.74 -0.74 -0.76 
extracurricular.credits.earned -0.55 -0.56 -0.56 -0.53 
all.results -0.49 -0.59 -0.62 -0.53 

negative.results 0.32 0.38 0.37 

grade.A -0.34 -0.42 -0.46 -0.46 

grade.B -0.47 -0.48 -0.50 -0.54 

grade.C -0.33 -0.35 -0.30 -0.40 


-0.66 -0.72 -0.78 -0.71 -0.70 -0.35 
-0.79 -0.80 -0.80 -0.74 -0.73 -0.47 
-0.58 -0.54 -0.59 -0.43 -0.48 

-0.60 -0.66 -0.75 -0.68 -0.63 -0.36 


0.30 0.31 


-0.43 -0.45 -0.38 -0.34 -0.35 
-0.56 -0.55 -0.53 -0.38 -0.41 
-0.43 -0.46 -0.48 -0.35 -0.36 


grade.F 0.31 
passed -0.66 -0.69 -0.65 -0.60 -0.58 -0.64 -0.65 -0.52 -0.35 
not.present 0.31 
Effort Features 
days.on.academic.leave 0.31 
days.studying.abroad  -0.30 
credits.cancelled -0.34 -0.44 


nr.of.courses.registered -0.58 -0.65 -0.71 -0.59 
credits.registered -0.67 -0.71 -0.73  -0.57 
total economic_support -0.48 -0.58 -0.52  -0.31 
study_period_in_years -0.40 -0.63 -0.40 -0.40 


-0.66 -0.72 -0.78 -0.71 -0.74 -0.43 
-0.67 -0.71 -0.77 -0.73 -0.74 -0.47 


-0.47 


-0.50 -0.59 -0.83 -0.72 -0.96 -0.87 


Table 1: Spearman’s Rank Correlation for the engineered features and student dropouts per admission year. Here we present 
correlations where p > |0.3| and p < 0.05. The features for the dimension of Academic Background did not appear to correlate 


strongly with dropouts over the admission years. 


2013 but no correlation appears for the past few years (with 
the exception of 2018). This may suggest that presently 
students are in a better financial situation and can therefore 
afford studying until they complete their degrees. Alterna- 
tively, it may indicate a change in the state’s or the univer- 
sity’s policy regarding tuition fees. 

The correlation analysis did not reveal any strong and sig- 
nificant relationship between dropouts and features of the 
student’s academic background. However, correlations only 
suggest the potential existence of relationships. Therefore 
additional analysis is necessary to establish whether the im- 
portance of the engineered features on dropouts may change 
over time. 


4.2 Multivariate Analysis of Variance 


Next, we performed a one-way MANOVA to investigate whether 


the engineered features vary significantly for student cohorts 
admitted over different academic years. The engineered fea- 
tures were the dependent variables and the admission year 
was the independent variable for each of the dimensions. 
The results of the MANOVA are presented in Table 2. For 
the academic background dimension, all the features appear 
to be significantly different among the independent groups 
(p < 0.05) which may indicate that the academic back- 
ground features are year-dependent. For both the perfor- 
mance and effort dimensions, the majority of features vary 
significantly among student cohorts of different admission 
years with p < 0.05. 

Based on the MANOVA results we assume that the engi- 
neered features appear to be significantly different for stu- 
dent cohorts based on the admission year. This may conse- 
quently signify that the impact of the log-based features on 
dropout can be time-dependent. 


4.3 Regression Analysis 


To further explore whether the performance of a predic- 
tive model depends on temporal aspects of training data, 
we carried out a (logistic) regression analysis with the vari- 
able ”dropout” as the dependent variable, the predictive fea- 
tures as the independent variables and admission year as the 
interaction term. Table 2 presents the features that inter- 
acted with admission year. Regarding students’ academic 
background, we found that the students’ previous experi- 
ence with the HEI is dependent on admission year while the 
normalized admission score is significant in terms of regres- 
sion analysis but marginal (p = 0.07) in terms of interaction 
with admission year. Time-dependency of previous experi- 
ence with the HEI may reflect structural or policy changes 
of the academic institution that affect students’ experience. 
Regarding the admission special conditions, we did not find 
any interaction with admission year or dropout (also evident 
from the correlation analysis). Furthermore, the results sug- 
gested that the students’ previous study level — in case of 
master’s level studies - may be important for dropout pre- 
diction and interact with admission year. A potential expla- 
nation could be that there is a confounding effect between 
the feature indicating previous studies in the same institu- 
tion and the feature indicating previous study level. 


Concerning students’ performance, the results suggest that 
features such as the credits a student earns or the grades 
they are awarded can be used to indicate dropout risk. How- 
ever, performance-based features do not appear to interact 
with admission year. In other words, their impact on pre- 
dicting student dropout does not depend on the year a stu- 
dent was admitted in the academic institution. Regarding 
the importance of features that denote effort, such as the 
time a student spends on an academic leave, and the num- 
ber of registered credits, they seem to have a different effect 
on student dropouts depending on the year of admission. 
This means that the features’ weight on dropout predic- 
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tion changes when coupled with the interaction term. This 
may indicate that the impact of these features on student 
dropouts is not consistent over time. We did not find any 
indications that effort-related features such as the credits a 
student cancels over the semester or the duration of studies 
(study period in years) depend on admission year. 


4.4 Proof of Concept 

To further explore the impact of time on modeling student 
dropouts using log-based student features, we split the data 
in two sets: a training and a test set. For the training set, 
we included all records of students admitted from 2010 to 
2020, except those who were admitted on 2011 and on 2018 
(both points representing instances close to the chronolog- 
ical beginning and the ending of our data collection). We 
used the training set to train a regression model, and we 
used the trained model as a binary classifier to predict stu- 
dent dropouts on the test set of unseen data. For both 
the training and testing, there are notable differences when 
examining the confusion matrices for the binary classifiers 
(Table 3). The binary classifier for performance and effort 
performed differently for student cohorts that were admitted 
on 2011 and on 2018 performed in terms of accuracy, pre- 
cision and recall while the results were similar for the aca- 
demic dimension. The model performed better on the 2011 
dataset while in terms of recall the model performed better 
on the 2018 dataset. Recall is important here as the objec- 
tive is to determine the students who are likely to dropout 
(positive class) and reduce the false negative outcomes (that 
is, students who were predicted as not at risk of dropping 
out but actually dropped out). Higher precision in 2011 
test set indicates the models’ dropout prediction inability 
as the model seem to retain less relevance with older data, 
(like, academic year 2011), resulting in lower false positives 
and increasing the overall precision value. On the contrary, 
higher recall in 2018 dataset indicates the model’s better fit 
with recent data, thus contributing in lower false negatives. 
However, we acknowledge that 2018 is fairly recent and some 
students enrolled in that year might not have dropped out 
yet, leading to inaccurate results. As for accuracy, there are 
36.05% (3273 out of 9078) instances are dropout (positive 
label) in the training dataset resulting in an imbalanced la- 
bel distribution. The proof of concept analysis supports the 
hypothesis of the paper that age of the data affects the mod- 
els’ performance, therefore models’ trained on newer data is 
important to increase the performance. We argue that this 
finding may suggest that the age of the data is pivotal to 
training predictive models. For the academic dimension, 
the overall performance was poor for both student cohorts. 


5. CONCLUSION 


In this paper, we explored the impact of log-based, engi- 
neered features that can be extracted from recorded student 
data on predicting student dropouts in Higher Education. In 
particular, we focused on investigating potential interactions 
between the engineered features and time — as represented 
by students’ admission year — that may affect the perfor- 
mance of student dropout predictive models. We argued 
that the age of the data we use to train machine-learning 
models for predicting dropouts will impact the models’ per- 
formance since socio-economic and cultural conditions, that 
arguably affect student retention, can change over time. For 
the purpose of this research, we engineered three sets of stu- 


dent features from data collected in the SIS of the HEI: one 
set describing the academic background of students, one set 
describing the performance of students and one describing 
the effort students put in their studies. To explore relation- 
ship between dropouts and features, and relationship be- 
tween features and admission year we combined correlation 
analysis, MANOVA and regression analysis with admission 
year as the interaction term. 


The results suggested that the admission year can play a 
critical role on the importance of the selected features for 
predicting dropouts. The importance of the features may 
change based on the socio-economic status of the state [11] 
which is subject to changes for multiple reasons, such as po- 
litical functions, joining an economic trade or alliance, or 
even cultural changes and emergency situations, such as the 
COVID pandemic. For example, in our case, this is demon- 
strated by the importance of financial support provided by 
the state on dropout rates over the years that seems to be 
decreasing. One can argue that student dropouts in Higher 
Education is a complex topic that extends beyond the aca- 
demic institution and the students themselves but it reflects 
socio-economic, cultural and political aspects of the society 
or the state. Thus, we would expect that the predictive 
power of engineered student features relating to societal or 
financial aspects — such as, the academic decisions students’ 
make in terms of investing effort and financial support — 
are susceptible to change over time. On the other hand, 
performance-related features (such as grades and positive 
or negative exam results) do not appear to interact with 
admission year but instead their effect remains steady over 
student cohorts, confirming prior work [14]. Features that 
aim to represent the students’ academic background may re- 
late to some extent to student dropouts — as the regression 
analysis suggested — but their predictive power is limited 
and their dependency on admission year requires further in- 
vestigation. To demonstrate the impact of time of predictive 
performance, we presented an example where we trained a 
binary classifier using time-sensitive features and we tested 
its performance on unseen data from two student cohorts 
that were admitted in the same HEI with a 7-year differ- 
ence. 


As a practical implication of this work, we envision establish- 
ing time-sensitive, predictive models for addressing student 
dropouts. Towards that direction, one approach would be to 
limit the datasets used for model training with respect to the 
chronology of the data, resulting in fewer older data as new 
data are received. However, this could lead to insufficient 
amount of data for training purposes. Another approach 
would be to incorporate ”forgetting” factors in order to min- 
imize the impact of old, non-relevant data. In this case, 
forgetting could be implemented by applying weights to the 
training set in such a way so that temporally distant or tem- 
porally irrelevant data receive lower weights (and thus, have 
less impact on the training) than recent entries. Similarly, 
for random forest or decision trees models one could regu- 
late the threshold limits for early stopping in tree growth as 
a means to include the forgetting factor. 


In this research, we carried out our analysis on data collected 
during the past decade from the same institution. This does 
not allow us to generalize our findings across various tem- 
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MANOVA 


Regression Analysis 


Regression with Interaction Term 


Feature Name F value Pr(> F) coef stderr  Pr(> |z|) coef std err Pr(> [2|) 
Academic Background 
normalized_score 11.88 5.9e-4 -0.55 0.22 0.01 -3.5e+2 1.94e+2 0.07 
admission.special.conditions 8.02 4.7e-3 0.03 0.29 0.26 1.7e+2 2.7e+2 0.52 
prev.study.level_Masters 4.58 0.03 -0.57 0.28 0.05 2.7e-1 1.5e-1 0.072 
nr.of.prev..studies.in. UT 8.40 3.8e-3 0.10 0.06 0.079 2.3e+2 6.7e+1 7.4e-4 
Performance 
negative.results 56.787 6.5e-14 0.19 0.14 0.18 -7.1e-2 2.8e-1 0.80 
grade.A = 52.37 5.9e-13 -0.12 0.06 0.05 1.99e-3 4.3e-2 0.96 
grade.B 225.32 <2.2e-16 -0.02 0.06 0.73 -5.5e-3 5.0e-2 0.91 
grade.C 226.77 < 2.2e-16 -0.06 0.06 0.30 -1.6e-2 4.9e-2 0.75 
grade.D 125.77 < 2.2e-16 -0.10 0.07 0.16 -2.2e-2 6.3e-2 0.73 
grade.E =. 32.16 1.6e-08 -0.09 0.08 0.23 -7.3e-2 8.5e-2 0.39 
grade.F 3.55 0.06 0.14 0.08 0.08 1.6e-3 5.96e-2 0.99 
credits.earned 1685 < 2.2e-16 -0.01 0.01 0.02 1.6e-3 5.le-3 0.75 
extracurricular.credits.earned 76.613 < 2.2e-16 | -2.6e-3 0.01 0.75 6.2e-3 7.5e-3 0.41 
not.present 173.25  < 2.2e-16 -1.7e-3 0.01 0.87 -7.3e-3 8.1e-3 0.36 
sum_passed_grade 1381.2 < 2.2e-16 0.18 0.14 0.19 -4,.2¢e-2 2.9e-1 0.88 
sum_failed_grade 108.05 < 2.2e-16 -0.01 0.02 0.66 
all results 1455.1 < 2.2e-16 -0.13 0.13 0.31 4.94e-2 2.8e-1 0.86 
Effort 
days.on.academic.leave 173.41 <2.2e-16 2.23 8.38¢e-2 <2e-16 7.5e-4 1.6e-4 2.6e-6 
on.extended.study.period 253.53 <2.2e-16 | 2.16e-01 4.60e-02 2.6e-6 1.8e-2 4.le-2 0.66 
days.studying.abroad 240.55 <2.2e-16 | -1.34e-02 1.37e-03 <2e-16 -1.2e-3 8.7e-4 0.17 
days.as.visiting.student 6.8963 8.7e-3 -1.92e-3 1.6e-3 0.22 
credits.cancelled.during.2w 476.68  <2.2e-16 | 1.0le-02  1.11e-03 <2e-16 5.5e-4 7.1e-4 0.44 
workload 6.9 8.7e-3 -1.1 8.5e-1 0.21 
nr.of.courses.registered 2501.5 <2.2e-16 | -5.89e-01 2.81e-02 <2e-16 -1.8e-1 1.5e-2 <2e-16 
credits.registered 2789.8  <2.2e-16 | -3.36e-02 1.81e-03 <2e-16 1.le-3 1.1e-3 0.32 
nr.of.courses.with.any.grade 2774.9 <2.2e-16 | 5.44e-01 2.65e-02 <2e-16 1.7e-1 1.4e-2 <2e-16 
nr.of.employment.contracts 20.657 5.6e-6 -6.1e-2 4.3e-2 0.16 
total_economic_support 17.14 3.5e-5 -3.02e-04 4.07e-05 1.15e-13 1.4e-4 2.4e-5 1.8e-8 
study_period_in_years 3941.1 <2.2e-16 1.29 7.71e-2 <2e-16 -1.6¢-2 4.6¢e-2 0.73 


Table 2: The results of MANOVA with the engineered features as the dependent variables and the admission year as the 
independent variable, Regression Analysis without any interaction terms and with admission year as an interaction term. The 
features that interact with admission year on the level p<0.05 are presented in bold letters 


Admission Year Accuracy Precision Recall Fl 
Academic Background 
2011 0.718 0.500 0.025 0.048 
2018 0.709 0.400 0.027 0.050 
Performance 
2011 0.912 0.937 0.738 0.825 
2018 0.785 0.573 1.000 0.728 
Effort 
2011 0.905 0.982 0.675 0.800 
2018 0.889 0.725 0.987 0.836 


Table 3: The perfromance metrics of the three models that 
predict student dropouts per dimension for two student co- 
horts: the cohort admitted on year 2011 and the cohort ad- 
mitted on year 2018. 
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poral and spatial contexts. Additionally, in this work we 
only used basic information about students’ background and 
study progress - excluding demographics. Further analysis 
on an extended dataset may reveal significant patterns on 
dropouts regarding cultural background or gender. How- 
ever, it is important to ensure the safe and ethical use of 
sensitive and personal information of students and to estab- 
lish that future use of the outcomes aims to support students 
and academic stakeholders in a fair and accountable context. 
In future work, we aim to design a predictive model for ad- 
dressing dropouts in HE that will implement the forgetting 
factor based on data’s recency. For triangulation, we will 
compare the forgetting factor’s impact both for a regression 
model and for a random forest model and we will explore 
further the impact of the forgetting factor in terms of pre- 
dictive accuracy, effectiveness and efficiency. Moreover, we 
will consider the possibility of analyzing gender segregated 
data to explore if the findings show gender bias [6]. 
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APPENDIX 


Feature 


Description 


Academic Background 


normalized_score 
admission.special.conditions 
prev.study.level 
nr.of.prev..studies.in. UT 


The admission score normalized by the min and max score 
Student’s admission subject to special conditions 

Student’s latest academic degree, such as high school graduate 
Number of previous enrollments in the same HEI 


Performance 


nr.of.courses.with.any.grade 
credits.earned 
extracurricular.credits.earned 
all.results 

negative.results 

pos.results 

grade{A, B, C, D, E, F} 
passed 


Registered courses with any outcome (positive or negative) 
Sum of credits the student earned 

Credits for courses extra to student’s curricula 

Number of all results cumulatively up to today 

Number of negative results up to today 

Number of positive results up to today 

Number of all grades {A, B, C, D, E, F} up to today 
Number of passed, non-differentiated courses up to today 


not.passed Number of not passed, non-differentiated courses up to today 
not.present Number of non-taken exams due to absence up to today 
Effort 


days.on.academic.leave 
on.extended.study.period 
days.studying.abroad 
days.as.visiting.student 
credits.cancelled 
nr.of.courses.registered 
credits.registered 

credits. fulfilled 
nr.of.employment.contracts 
total _financing 
study.workload 
study_period_in_years 


Days the student was on academic leave 

1 when student was on extended study period, 0 otherwise 

Days student was studying abroad (e.g. on an Erasmus exchange) 
Number of days as visiting student to other Estonian universities 
Number of credit points that the student cancelled 

Number of courses the student registered 

Number of credits the student registered 

Ratio of credits earned vs. credits registered 

Number of contracts the student has with the HEI 

Total amount of stipends and allowances 

Full time or part time student 

Number of years a student has been studying 


Table 4: The engineered features for each dimension. By ”up to today”, we mean the date of the data collection (19 Oct. 2020) 
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ABSTRACT 


Knowledge tracing (KT), the task of tracking the knowledge 
state of each student over time, has been assessed actively 
by artificial intelligence researchers. Recent reports have 
described that Deep-IRT, which combines Item Response 
Theory (IRT) with a deep learning model, provides superior 
performance. It can express the abilities of each student 
and the difficulty of each item such as IRT. However, its 
interpretability and applicability remain limited compared 
to those of IRT because the ability parameter depends on 
each item. Namely, the ability estimate for the same student 
and time might differ if the student attempts a different 
item. To overcome those difficulties, this study proposes a 
novel Deep-IRT model that models a student response to an 
item by two independent networks: a student network and 
an item network. Results of experiments demonstrate that 
the proposed method improves prediction accuracy and the 
interpretability of earlier KT methods 


Keywords 
Deep Learning, Item Response Theory, Knowledge Tracing 


1. INTRODUCTION 


Recently, along with the advancement of online education, 
Knowledge Tracing (KT) has attracted broad attention for 
helping students to learn effectively by presenting optimal 
problems and a teacher’s support [5, 14, 16, 22, 23, 24, 37, 
39, 43, 45, 46]. Important tasks of KT are tracing the stu- 
dent’s evolving knowledge state and discovering concepts 
that the student has not mastered based on the student’s 
prior learning history data. Furthermore, predicting a stu- 
dent’s performance (correct or incorrect responses to an un- 
known item) accurately is important for adaptive learning. 
Many researchers have developed various methods to solve 
KT tasks. Methods for KT are divisible into probabilistic 
approaches and deep-learning approaches. 


For example, Bayesian Knowledge Tracing (BKT), a tradi- 
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tional and well known probabilistic model for KT [1, 5, 8, 14, 
16, 22, 23, 26, 45], employs a Hidden Markov Model to trace 
a process of student ability growth. It predicts the proba- 
bility of a student responding to an item correctly. Item 
Response Theory (IRT) [3, 34, 35], which is used in the test 
theory area [10, 11, 12, 13, 28, 33, 36], has come to be used 
for KT [6, 40]. Actually, IRT predicts a student’s correct 
answer probability to an item based on the student’s latent 
ability parameter and item characteristic parameters. 


Actually, a learning task is associated with multiple skills. 
Students must master the knowledge of multiple skills to 
solve a task. However, BKT and IRT have a restriction by 
which they express only uni-dimensional ability. 


To overcome the limitations, Deep Knowledge Tracing (DKT) 
[24] was proposed as the first deep-learning-based method. 
DKT employs Long short - term memory (LSTM) [27] to 
predict a student’s performance. LSTM relaxes the restric- 
tions of skill separation and binary state assumptions. How- 
ever, the hidden states include a summary of the past se- 
quence of learning history data in LSTM. Therefore, DKT 
does not explicitly treat the student’s ability of each skill. 


To improve the DKT performance, various deep-learning- 
based methods have been proposed [2, 4, 17, 19, 29, 30, 
31, 38, 42, 44]. Especially, the dynamic key-value memory 
network (DKVMN) was developed to exploit the relations 
among underlying skills and to trace the respective knowl- 
edge states [46]. To trace student ability, DKVMN uses a 
Memory-Augmented Neural Network and attention mecha- 
nisms. Furthermore, to improve the explanatory capabilities 
of the parameters, Deep-IRT was proposed by combining 
DKVMN with an IRT module [43]. In fact, Deep-IRT can 
estimate a student’s ability and an item’s difficulty just as 
standard IRT models can. However, the ability parameter of 
the Deep-IRT depends on each item characteristic because 
it implicitly assumes that items with the same skills are 
equivalent. The assumption does not hold when the item 
difficulties for the same skills differ greatly. Items for the 
same skills which are not equivalent hinder interpretation of 
a student’s ability estimate. 


Most recently, Gosh et al. (2020) proposed attentive knowl- 
edge tracing (AKT) [7], which incorporates a forgetting func- 
tion of past data to attention mechanisms. Additionally, 
they indicated a problem by which earlier KT methods as- 
sumed that items with the same skills are equivalent. To re- 
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solve that difficulty, they employed both items and skills as 
inputs. The predictive accuracy of a student’s performance 
was improved by AKT. However the interpretability of the 
parameters is limited because it cannot express a student’s 
ability transition of each skill. 


Earlier studies have tackled to develop deep-learning-based 
methods to give parameter interpretability similarly to IRT 
models, but those studies have not achieved it for student 
ability parameters, which are most important for student 
modeling. The problem is the difficulty of incorporating the 
ability parameters and item parameters independently into 
deep-learning-based methods so as not to degrade prediction 
accuracy. This study addresses that problem. 


Recent studies of deep learning have shown that redundancy 
of parameters for training data reduces generalization error, 
contrary to Occam’s razor. The studies also clarify the rea- 
sons [9, 20, 21]. Based on state-of-the-art reports, this study 
proposes a novel Deep-IRT that models a student’s response 
to an item by two independent redundant networks: a stu- 
dent network and an item network. The proposed method 
learns student parameters and item parameters indepen- 
dently to avoid impairing the predictive accuracy. A student 
network employs memory network architecture to reflect dy- 
namic changes of student abilities as DKVMN does. There- 
fore, the ability parameters of the proposed method do not 
depend on each item characteristic. They have higher inter- 
pretability than those of Deep-IRT. Moreover, the proposed 
method employs both items and skills as inputs in a differ- 
ent mode of Gosh et al. (2020) [7]. Although Tsutsumi et 
al. previously proposed a Deep-IRT for test theory, it can- 
not be applied to KT because a student’s ability is constant 
throughout a learning process [32]. 


2. RELATED WORK 


2.1 Item response theory 

There are many item response theory (IRT) models [3, 18, 
34, 35, 41]. This subsection briefly introduces two-parameter 
logistic model (2PLM): an extremely popular IRT model. In 
2PLM, the probability of a correct answer given to item 7 by 
student 7 with ability parameter 0; € (—oo, co) is assumed 
as 


1 
1+ exp(—1.7a; (6; — b;))’ 


P;(0i) (1) 
where a; € (0,00) is the j-th item’s discrimination param- 
eter expressing the discriminatory power for student’s abil- 
ities, and b; € (—oo, 00) is the j-th item’s difficulty param- 
eter representing the degree of difficulty. 


2.2 Dynamic key-value memory network 

The salient feature of DKVMN is that it assumes N underly- 
ing skills and relations between the input (items). Underly- 
ing skills are stored in key memory M*® € R“*¢*. However, 
value memory M? € R%** holds abilities of underlying 
skills at time t. Here, d, and d; are tuning parameters. To 
express the j-th item, the input of DK VMN is a one-hot vec- 
tor q; € {0, ly, where J represents the number of items for 
which the j-th element is 1 and for which the other elements 
are zeroes. DKVMN predicts the performance of item 7 at 
time t as explained below. 


First, DKVMN calculates the attention, which indicates how 
strongly an item 7 is related to each skill as 


Bi? = Ww) q; ah 7 (Fu) (2) 
wu = Softmax (ia?) ; (3) 


where M/ represents a 1 th row vector and w;; signifies the 
degree of strength of the relation between skill | and item 
j addressed by a student at time t. In addition, W“ is 
the weight matrix and weight vector. +) is the bias vector 
and scalar. Next, student vector a) is calculated using the 
weighted sum of value memory. 


Oy) = 0 wa (Mi)" (4) 


Finally, it concatenates a) with ond ) and predicts correct 
probability P;; for an item j as 


0? = tanh (WO Jol, a] +7), (5) 
Py =o (W6? +7), (6) 
where M,; represents the J th row vector of M7, [-] is a 


concatenation of vectors, and o(-) represents the sigmoid 
function. Reportedly, DKVMN has the capability of accu- 
rately predicting performance. However, unfortunately, a 
lack of the interpretability of the parameter remains. 


2.3. Deep-IRT 
Deep-IRT is implemented by combining DKVMN with an 
IRT module [43] to improve the DKVMN interpretability. 


Deep-IRT exploits both the strong prediction ability of DK VMN 


and the interpretable parameters of IRT. Deep-IRT adds a 
hidden layer to DKVMN to gain the applicable ability and 
item difficulty. Specifically, when a student attempts item 7 
at time ¢, an ability ots ) and item difficulty fag ) are caleu- 
lated as shown below. 


of) = tanh (WO60) + 7%) , (7) 


69) = tanh (wa? rg a2) ' (8) 


The prediction is based on the difference between gs ) and 
BY such as IRT. 


Pye (3.0 «99 pe) (9) 


Here, ability 9) is calculated using we in equation (6), 
which depends on the item to solve because it implicitly as- 
sumes that items with the same skills are equivalent. In 
other words, the ability estimate for the same student and 
time might differ if the student attempts a different item. 
Furthermore, in equation (7), Deep-IRT uses item vector 
og ) to calculate 6, An important difficulty is that a stu- 
dent’s ability, which depends on each item, hinders the in- 
terpretability of the parameters. Although Tsutsumi et al. 
[32] also proposed a Deep-IRT as a test theory, the purpose 
is different from this study because it can not be available 
for KT as mentioned before. 
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Figure 1: Network architecture for Deep-IRT with indepen- 
dent student and item networks. The yellow components rep- 
resent the process of getting the attention weight. Also, the 
green components are associated with the student network 
and the process of updating the value memory. The blue 
components are associated with the item network. 


3. DEEP-IRT WITH INDEPENDENT STU- 
DENT AND ITEM NETWORKS 


To resolve the difficulty described above, this study proposes 
a novel Deep-IRT method comprising two independent neu- 
ral networks: the student network and Item deep network, 
as shown in Figure 1. The student network employs memory 
network architecture such as DKVMN to ascertain changes 
in student ability comprehensively. The item network in- 
cludes inputs of two kinds: the item attempted by a student 
and the necessary skills to solve the item. Using outputs 
of both networks, the probability of a student answering an 
item correctly can be calculated. 


The proposed method can estimate student parameters and 
item parameters independently such that prediction accu- 
racy does not decline because the two independent networks 
are designed to be more redundant than with earlier meth- 
ods , based on state-of-the-art reports [9, 20, 21]. The pro- 
posed method predicts P,;, the probability of a correct an- 
swer assigned to item 7 at time t, using the item difficulties 
and the student abilities, as follows. 


3.1 Item network 

In the item network, two difficulty parameters of item 7 
are estimated: the item characteristic difficulty parameter 
Biem and the skill difficulty 87,,,, to solve item j. The 
item characteristic difficulty parameter indicates the unique 
difficulties of the item, excepting the required skill difficulty. 
The proposed method expresses item difficulty as the sum 
of the two difficulty parameters of §?,,,,, and 82,.,)). 


As with DKVMN, to express the j-th item, an input of the 


item network is a one-hot vector qj € R’ as shown below. 


i f G =m) (10) 


0 (otherwise) 


Here, J stands for the number of items. The item network 
comprises n layers. The item characteristic difficulty pa- 
rameter of item 7 is calculated using a feed forward neural 
network as 


2 = tanh (w'q; + ae) ; (11) 
cH = tanh (wee + ao) ; (12) 
oe _ W item) Bi fe 7 Pitem) | (13) 


where k = {2,...,n}. The last layer Bo represents the j-th 


item 
item characteristic difficulty parameter. 


Similarly, to compute the difficulty of skills, the proposed 
method uses the input of necessary skills s; € R° as pre- 
sented below. 


iegs 1 (item j neuares skill m) (14) 
0 (otherwise) 
Here, S' represents the number of skills: 
+} = tanh (ws; + a) : (15) 
i = tanh (WO yf, +7), (16) 
ee = W (Berit) ay at Perit) (17) 


where k = {2,...,n}. The last layer eee denotes the diffi- 
culty parameter of the required skills to solve the j-th item. 


3.2 Student network 


In the student network, the proposed method calculates 64 
based on the past response history as 


N 
a? = S> Mri, (18) 
l=1 


where M; is a memory matrix holding a students’ latent 
knowledge state, which are estimated similarly to DKVMN. 
Next, an interpretable student’s ability vector 6%, is esti- 
mated as follows. Therein, n represents a number of hidden 
layers decided depending on the prediction accuracy of ac- 
tual data. 


of = tanh (WP OED 4+ 7) , (19) 


gD — wi! ol, (20) 


where k = {2,...,n}. As a difference between the proposed 
method and Deep-IRT, the proposed method does not mul- 
tiply the attention in equation (18). In addition, el) is 
not calculated using features of items such as equations (5) 
and (7). Therefore, the ability parameter vector 0‘) does 
not depend on each item. Namely, it is independent from 
the difficulty parameter. The value of which denotes the 
ability for the corresponding latent skill because it is inde- 
pendent of any item. Therefore, ee ) can be interpreted as 
a measurement model such as a multidimensional IRT [25]. 
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3.3. Prediction of student response to an item 
The proposed method predicts a student’s response prob- 
ability to an item using the difference between a student’s 
ability A) to solve item j at time t and the sum of two 
difficulty parameters (?,.,,, and 7,1). 


Pry =o (3.0% 0 — (Bhem + Ben) 21) 


After the procedure, the value memory is updated using 
c; based on the input gq; and actual performance such as 
DKVMN [46]. 


The loss function of the proposed method employs cross- 
entropy, which reflects classification errors. The cross-entropy 
of the predicted responses P;; and the true responses uz; is 
calculated as 


C(ur, Pi) = — 5 (way log Pry + (1 — wey) log(1 — Pay), 
t 

(22) 
where uz; is the true response to item 7 at time t. The 
student’s response uz; is recorded as 1 when the student 
answers the item correctly and 0 otherwise. All parameters 
are learned simultaneously using a well known optimization 
algorithm: adaptive moment estimation [15]. 


4. PREDICTIVE ACCURACY 
4.1 Datasets 


We conduct experiments to compare the performance of our 
approach against existing solutions. This section presents 
comparison of the prediction accuracies for student perfor- 
mance of the proposed method with those of earlier methods 
(DKT, DKVMN, Deep-IRT, AKT) using four benchmark 
datasets as ASSISTments2009', ASSISTments20157, Stat- 
ics2011 *, KDDcup*. ASSISTments2009 and KDDcup have 
item and skill tags, although most methods explained in 
the relevant literature adopt only the skill tag as an input. 
However, methods with skill inputs rely on the assumption 
that items with the same skill are equivalent [7]. That as- 
sumption does not hold when an item’s difficulties in the 
same skill differ greatly. Therefore, as inputs to AKT and 
the proposed method, we employ not only skills but also 
items.ASSISTments2015 has only the skill tag. Therefore, 
we employ only the skill tag as an input. 


Table 1 presents the number of students (No. Students), 
the number of skills (No. Skills), the number of items (No. 
Items), the rate of correct responses (Rate Correct), the 
average length of items which students addressed (Learning 
length), and the rate of items in which the number of student 
addressed is less than 10 (Sparsity). For all the datasets, 
we excepted students who addressed fewer than five items. 
Additionally, we set 200 items as the upper limit of the input 
length according to an earlier study [43]. When the input 
length of items becomes greater than 200, we use the first 
200 response data for all methods. 


‘https: //sites.google.com/site/assistmentsdata/home/ 
assistment-2009-2010-data 

“https: //sites.google.com/site/assistmentsdata/home/2015- 
assistments-skill-builder-data 

3https: / /pslcdatashop.web.cmu.edu/DatasetInfo?datasetId 
=507 


‘https: //pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp 
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Figure 2: AUC and the Number of Layers 


4.2 Hyperparameter selection and evaluation 
We used ten-fold cross-validation to evaluate the prediction 
accuracies of the methods. The item parameters and the 
hyperparameters are learned using 70% of datasets. Given 
the estimated hyperparameters, a student’s ability can be 
estimated at each time using the remaining 30% of each 
dataset. For all methods, the hidden layer size and memory 
dimension are chosen from {10, 20,50, 100, 200} using cross- 
validation. In addition, for the earlier methods, we used the 
hyperparameters reported from earlier studies [7, 43]. 


To ascertain the number of layers n for the proposed method, 
we conducted some experiments to gain experience using 
ASSISTments2009 while changing the layer number. The 
results are presented in Figure 2. As shown in the figure, 
AUC score reaches its highest level when n = 2 and n = 4. 
Based on this result, we employ n = 2 for the following 
experiments because the computation time of the proposal 
increases exponentially as the number of layers increases. 


If the predicted correct answer probability for the next item 
is 0.5 or more, then the student’s response to the next item 
is predicted as correct. Otherwise, the student’s response 
is predicted as incorrect. For this study, we leverage three 
metrics for prediction accuracy: Accuracy (Acc) score, AUC 
score, and Fl score. The first, Acc, represents the con- 
cordance rate between the student predictive performance 
and the true performance. The second, AUC, represents 
the predictive accuracy of the correct answer probabilities. 
F 1 indicates the average of the F1 score of incorrect answer 
prediction and the F1 score of correct answer prediction. 


4.3 Results 

The respective values of Acc, AUC, and F1 for those bench- 
mark datasets are shown in Table 2. Results show that 
the proposed method with item and skill inputs provides 
the best performance for the metrics: averages of Acc and 
F1. Especially noteworthy is that the proposed method out- 
performs AKT, which is the most advanced method. Fur- 
thermore, the proposed method with item and skill inputs 
provides better performance than that with skill or item in- 
puts. These results indicate that parameter estimation, not 
only with skill but also with item, improves the predictive 
accuracy. 
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Table 1: Summary of Benchmark Datasets 


Dataset No. students No. skills No. Items Rate Correct Learning Length Sparsity 
ASSIST 2009 4,151 111 26,684 68.0% 70.8 55.2% 
ASSIST 2015 19,840 100 N/A 73.2% 34.2 12.6% 
Statics2011 333 156 1,223 77.7% 180.9 2.6% 

KDDcup 820 43 476 78.3% 11.9 57.8% 

Table 2: Predictive Accuracy of Student Performance with Benchmark Datasets 
DKT DKVMN_ Deep-IRT AKT AKT Proposed Proposed 
(item&skill) (item&skill) 

Acc | 0.759 0.763 0.768 0.692 0.755 0.768 0.765 

ASSIST2009 | AUC | 0.781 0.807 0.806 0.717 0.811 0.818 0.810 
F1_ | 0.697 0.714 0.718 0.639 0.726 0.725 0.722 

Acc | 0.754 0.749 0.747 0.757 N/A 0.752 N/A 

ASSIST2015 | AUC | 0.730 0.732 0.727 0.760 N/A 0.751 N/A 
F1 | 0.4383 0.541 0.540 0.616 N/A 0.543 N/A 

Acc | 0.769 0.805 0.817 0.809 0.818 0.819 0.822 

Statics2011 | AUC | 0.666 0.819 0.822 0.821 0.827 0.821 0.821 
Fl | 0.483 0.679 0.681 0.690 0.677 0.679 0.690 

Acc | 0.784 0.773 0.792 0.774 0.780 0.786 0.802 

KDDcup AUC | 0.538 0.594 0.588 0.606 0.610 0.588 0.610 
F1 | 0.439 0.439 0.455 0.441 0.449 0.469 0.478 

Acc | 0.767 0.773 0.781 0.758 0.784 0.781 0.796 

Average AUC | 0.679 0.738 0.736 0.726 0.749 0.745 0.747 
Fl | 0.513 0.593 0.599 0.597 0.617 0.604 0.630 


However, AKT with item and skill inputs shows the best 

average values of AUC. Actually, AKT with item and skill 

inputs also provides higher performance than that achieved 

with skill or item inputs, as shown in [7]. Gosh et al. (2020) 

reported that AKT is more effective for large datasets. There- 
fore, AKT provides the best performance for all the metrics 

of ASSISTments2015, which has an extremely large number 

of students. 


Furthermore, surprisingly, the averages of ACC, AUC, and 
F1 obtained using the proposed method with skill input are 
better than Deep-IRT, although the proposed method sepa- 
rates student and item networks. This result implies that re- 
dundant deep student and item networks function effectively 
for performance prediction. These results are explainable 
from reports of state-of-the-art methods [9, 20, 21]. 


The performance results obtained using DKVMN are almost 
identical to those obtained using Deep-IRT because they 
have almost identical network structures. Results show that 
DKT provides the worst performance among the methods 
studied here. 


5. PARAMETER INTERPRETABILITY 


5.1 Interpretability of difficulty parameters 

To evaluate the interpretability of the difficulty parameters 
of the proposed method, we compare the parameters of IRT 
with those of Deep-IRT using a simulation data. The dataset 
includes 2000 students’ responses to 50 items and it is gen- 
erated from 2PLM as shown in equation (1). The priors of 
the parameters have 8 ~ N(0,1),a ~ LN(0,1),b ~ N(0,1). 
We estimated the parameters of the proposal and Deep-IRT 
using the dataset. Table 3 shows the Pearson correlation 
between the true parameters of the true models and the es- 
timated parameters, respectively, of the proposed method 
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Table 3: Pearson correlation 


parameter | Deep-IRT Proposed 
difficulty 0.611 0.886 
accuracy 0.694 0.695 


and Deep-IRT. Additionally, we show the prediction accu- 
racies of the proposed method and Deep-IRT for the dataset. 
The proposal provides higher correlations with true parame- 
ters than Deep-IRT does, whereas the proposed method has 
higher accuracy than Deep-IRT has. The results demon- 
strate that the two independent networks of the proposed 
method function effectively for the interpretability of the 
estimated parameters and for the prediction accuracies. 


5.2 Student ability transitions 

This section shows student ability transitions using the pro- 
posed method. Visualizing the ability transition for each 
skill is helpful for both students and teachers because they 
can discover student strengths and weaknesses and can im- 
prove the learning method to fill in the learning gaps. Ye- 
ung [43] demonstrated a student ability transition for each 
skill using Deep-IRT. However, their results included some 
counter-intuitive ability estimates. For example, even when 
the student answered incorrectly, the corresponding student 
ability estimate increased. Moreover, Deep-IRT cannot iden- 
tify a relation among multidimensional skills. There are 
cases in which a student’s ability for low-level skills decreases 
even when the student responds correctly to items for high- 
level skills. These unstable behaviors of Deep-IRT might 
engender serious difficulties, which will consequently confuse 
students and teachers, as a student model. 


Figure 3 depicts a student’s ability transitions of the pro- 
posal for the ASSIST 2009 dataset. The vertical axis shows 
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Figure 3: An example of a student ability transition from the ASSIST2009 dataset. The skill tags are classified respectively as 
equation solving two or fewer steps (blue), ordering fractions (orange), finding percents (green), and equation solving more than 
two steps (red). The student responses to items are shown at the bottom of the graph. 


the student ability on the left side, with the student’s re- 
sponse to an item on the right side. The horizontal axis 
shows the item number. The student’s response is 1 when 
the student answers the item correctly; it is 0 otherwise. The 
student attempted skills of "equation solving more than two 
steps” (shown in red), "equation solving two or few steps” 
(shown in blue), "ordering factions” (shown in orange), and 
“finding percents” (shown in green). Figure 3 can be inter- 
preted as explained below. 


1. Theta 1 decreases when the student responds to item 2 
“ordering factions” (orange) incorrectly and it increases 
when the student responds to item 3 correctly. There- 
fore, theta 1 indicates the ability of ”ordering factions”. 


2. Items 6-17 correspond to the skill of ”equation solving 
two or few steps”(blue). Theta 2 indicates the ability 
of "equation solving two or few steps” because theta 2 
greatly increases while the student answers correctly. 


3. For the skill of ”finding percents” (green), the student 
answers all items incorrectly. Theta 3 indicates the 
ability of "finding percents” (green) because it greatly 
decreases in items 18-24. 


4. Items 4, 5, and 25-30 correspond to the skill of ”equa- 
tion solving more than two steps” (red). Theta 4 de- 
creases when the student answers to item 4 and 5 in- 
correctly, and increases when the student answers to 
items 26-29 correctly. Therefore, theta 4 represents 
the ability of ”equation solving more than two steps” 


(red). 


Figure 3 shows that the proposed method estimates the abil- 
ity of each skill to reflect the student responses. Addition- 
ally, it estimates relations among the skills. Therefore, when 
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a student responds to an item correctly/incorrectly, not only 
does the corresponding skill ability increase/decrease; those 
for other skills increase/decrease as well. Consequently, the 
results demonstrate that the proposed method improves both 
the interpretability and the prediction accuracies of Deep- 
IRT. 


6. CONCLUSIONS 


This study proposed a novel Deep-IRT that models a stu- 
dent’s response to an item by two independent redundant 
networks: a student network and an item network. Because 
two independent redundant neural networks are used, the 
parameters of the proposed method can be highly inter- 
preted with keeping hight prediction accuracy. Moreover, 
the proposed method employs both items and skills as in- 
puts. Experiments demonstrated that the proposed method 
with item and skill inputs provided the best performance for 
the metrics: averages of Acc and Fl. deep-learning-based 
methods. The result also showed AKT with item and skill 
inputs provided the best average values of AUC. Especially, 
AKT provided the best performances for large datasets as 
Gosh et al. (2020) reported [7]. In addition, results of ex- 
periments show that the parameters of the proposed method 
are more interpretable than those of Deep-IRT. This study 
employed slightly redundant deep networks compared to ear- 
lier methods. As future work, we intend to use the proposed 
method to investigate the performances of more redundant 
and deeper networks. In addition, we will try to optimize a 
forgetting function for past data to maximize the prediction 
accuracy for large data sets. 
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ABSTRACT 


Promoting creativity is considered an important goal of edu- 
cation, but creativity is notoriously hard to define and mea- 
sure. In this paper, we make the journey from defining a 
formal creativity and applying the measure in a practical 
domain. The measure relies on core theoretical concepts in 
creativity theory, namely fluency, flexibility, and original- 
ity, We adapt the creativity measure for Scratch projects. 
We designed a machine learning model for predicting the 
creativity of Scratch projects, trained and evaluated on rat- 
ings collected from expert human raters. Our results show 
that the automatic creativity ratings achieved by the model 
aligned with the rankings of the projects of the expert raters 
more than the experts agreed with each other. This is a first 
step in providing computational models for describing cre- 
ativity that can be applied to educational technologies, and 
to scale up the benefit of creativity education in schools. 


Keywords 
Creativity, Creativity Tests, Visual Programming Environ- 
ments 


1. INTRODUCTION 

Modern education generally tries to foster creativity in stu- 
dent problem solving [7] [I] [T7]. There is wide agreement 
that creative solutions must not only solve the task but 
should additionally be original, i.e. distant from usual so- 
lutions to the task, flexible, i.e. employ very different con- 
cepts, and fluent, i.e. employ many concepts [10] [23]. How- 
ever, creativity is notoriously hard to quantify in practice [7]. 
When being confronted with two student solutions for a 
given learning task, different teachers may well disagree which 
one is more creative |14|. 
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In this paper, we make the journey from a definition of cre- 
ativity which relate to prior concepts in the literature [23], 
to applying the definition to the Scratch programming envi- 
ronment, and using the measure to automatically quantify- 
ing the creativity score of projects in Scratch. We formalize 
originality of a product as the distance to usual solutions, 
flexibility as the distance between concepts in the student’s 
solution, and fluency as the distance to an empty solution. 


We apply the formalization to automatically measure the 
creativity on a set of projects from the popular visual pro- 
gramming language Scratch [13]. Using machine learning, 
we build a model that predicts the creativity ratings of 
Scratch projects using fluency, flexibility, and originality 
measures of our approach. We compare these automatic cre- 
ativity ratings to those of five human experts, which were 
collected using a comprehensive user study. We find that 
the automatic ratings agree with the rankings of the experts 
more than the experts agreed with each other. We provide 
several examples that highlight the benefit of the model in 
light of the fact that human raters may disagree on the de- 
gree of creativity of Scratch projects. 


The contribution of this work is in providing an automatic 
framework for defining and detecting creativity, that can 
scale up teacher’s abilities to support creative thinking in 
students. 


2. RELATED WORK 


Prior works on measuring creativity have mostly been con- 
cerned with psychological tests, such as Williams’ tests on 
creative thinking or the Torrance test of creative think- 
ing [10] [23]. However, such tests do not account for changes 
in creative ability, motivation, knowledge, and social con- 
text over time [2]. Accordingly, one should wish to measure 
creativity often and monitor the development across chang- 
ing circumstances. This could be supported by automatic 
creativity assessment, towards which we work in this paper. 


To measure creativity at one specific point in time, we follow 
Torrance’s work and grade creativity on three scales, namely 
fluency, flexibility, and originality [10|[23]. Historically, these 
three scales grew out of Guilford’s model of the structure of 
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intellect [6], which includes divergent production, i.e. the 
skill of generating a wide variety of ideas on the same topic. 
In particular, fluency refers to the sheer amount of ideas gen- 
erated, flexibility to the number of distinct classes of ideas, 
and originality to the infrequency of the ideas compared to a 
general sample of the population. In addition to these three 
scales, Torrance’s tests also include scales regarding the ab- 
stractness of generated ideas, the elaboration on ideas, and 
the resistance to premature closure during the generation 
process [10]. In this work, we stick to fluency, flexibility, and 
originality because they permit quite a direct formalization 
in mathematical and computeable terms. We further cover 
elaboration, to some extent, in our notion of fluency. 


Multiple works focused on using technologies to infer cre- 
ative thinking in numerous educational disciplines, such as 
math and programming [21|[12]. Previous research has shown 
that creativity is related to positive learning gains, and us- 
ing technology to generate creativity is an active field of 
study [24]. Hershkovitz et al. [8] [9] examined the relation- 
ship between creativity and computational thinking within 
a block-based multi-level game environment for children’s 
programming. Their findings show that creativity can con- 
tribute to the acquiring of computational thinking and can 
also be transferred across domains, stressing the importance 
of fostering creativity while promoting computational think- 
ing. 


Finally, the application domain of our work is Scratch, a 
block-based visual programming language targeted primar- 
ily at children. The environment allows users to create in- 
teractive stories, games, and animations [13]. Scratch blocks 
are designed to fit together in ways that make syntactic 
sense which generates the program logic. Users can use a 
wide variety of pre-defined basic code blocks, such as When 
Key Pressed, Move etc. Furthermore, programmers can use 
additional blocks such as Pen Down and Language from ex- 
isting extensions such as ‘Pen’ and ‘Translate’, as well as 
define custom blocks. The environment allows the use of 
external data through importing images, music recordings, 
captured voices, and user-specific graphics |18}. By using 
Scratch, which is designed to enable creative expression in 
terms of code, graphical and audio aspects, we can expand 
the identification of creativity beyond the programming as- 


pect [3] (5) [12]. 


3. COMPUTING CREATIVITY IN SCRATCH 


In this section, we describe how we measure creativity of 
code and visual aspects of Scratch programs. 


A Scratch project consists of the project’s background called 
stage and the objects that appear on it called sprited!| Fig- 
ure [1] presents a sample project, a game called ‘Scratch in 
Scratch’. As its name implies, it simulates the Scratch en- 
vironment. The player has to select a character and add 
block instances to a stack of blocks that control the char- 
acter’s behavior on the stage. The figure shows the white 
stage and different sprites (buttons, arrows and a cartoon 
character), as well as the graphics output area. Blocks are 
code elements that control the behavior of the stage and 
sprites [13]. When a sprite is selected, its blocks element 
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are shown in the Code panel. Figure |1| (center) shows the 
blocks that are connected to the cartoon cat, whose sprite 
is selected (e.g., blocks Hide and Show). 


Inspired by the creativity test of Torrance [23], we measure 
creativity on three scales: fluency, flexibility, and originality. 
Generally speaking, fluency refers to the amount of ideas in- 
volved in the Scratch project, flexibility to the diversity of 
ideas, and originality to the distance between a project and 
typical projects [10]. In the following, we describe how we 
compute these scales for code and visual aspects, respec- 
tively. For both aspects, our strategy is to first define a 
distance between building components (e.g. code blocks and 
images) and then compute fluency, flexibility, and originality 
based on that distance. 


Code Creativity. We represent the Scratch code as a col- 
lection of syntax trees, one representing the stage, and one 
for each sprite. Each syntax tree, in turn, consists of code 
blocks. Figure |2} shows a graph of the blocks in Scratch, 
where blocks are connected to the semantic sub-category 
they belong to (like move or events) and sub-categories 
are connected to categories, namely basic blocks, extension 
blocks, and custom blocks. Let now 6 be the shortest-path 
distance between blocks in this graph. More specifically, we 
define 6 as zero for equal blocks (e.g., two Move blocks), as 1 
if the blocks are different but within the same sub-category 
of blocks (e.g., Move, Turn), and as 2 if the blocks are dif- 
ferent and from different sub-categories (e.g., Move, When 
Key Pressed). To explain, we add one unit of distance for 
the transition between the sub-categories and another for 
the blocks being different. 


For different categories, the distance between the pre-defined 
blocks and the extension blocks is defined as 3, and the dis- 
tance between the pre-defined blocks to the custom blocks 
is defined 4. To explain, we add one one unit of distance 
for the block’s difference, the second unit of distance due to 
the different sub-categories, and the third for the category 
change. Since the Scratch environment provides by default 
the pre-defined blocks, while custom blocks require the user 
to build something new, we add an additional unit of dis- 
tance. 


Based on the distance, we define code fluency of a Scratch 
project as the sum }°>, 6(x,0), where x are the code blocks 
in the project and 0 is the gray zero node in Figure [2] The 
distance 6(x,0) to basic blocks (e.g. Move) is defined as 
3, the distance 6(2,0) to extension blocks (e.g., Pen-Up, 
Language) is 4, and the distance 5(«, 0) to custom blocks is 5. 
In other words, we assign higher fluency for the production 
of non-existent components or the use of custom blocks that 
require additional user effort. For example, in the ‘Scratch 
in Scratch!’ program the Show block presented in Figure [I] 
is a basic block from the sub-category ‘Looks’. The program 
gets 3 points for this block, and an overall fluency score of 
1052 


To compute code flexibility, we remove all duplicated blocks, 


?For mathematical reasons we square the distances, thus 
yielding large numbers. 
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Figure 1: A screenshot of an example Scratch project. Left: A selection of possible code blocks. Center: The code blocks related 
to the currently selected sprite (the cat). Top right: The current graphical output. Bottom right: An overview of all sprites 


and the stage (i.e. the background). 
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Figure 2: A graph of code blocks in scratch. We compute the 
distance between blocks by their shortest path distance in the 
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then compute the sum of all pairwise distances )7 ae 6(a, y) 
between code blocks in the project and divide by the num- 
ber of unique code blocks. This measures how different the 
code blocks were, capturing the idea of flexibility as vari- 
ability of concepts {10}. For example, the project ‘Scratch 
in Scratch!’ uses the blocks Hide and Show for each sprite. 
This increases fluency but not flexibility. Further, these two 
pre-defined blocks belong to the same sub-category ’Looks’ 
and so 6( Hide, Show) = 1. Overall, after normalizing by the 
number of the unique blocks in the program (58), it obtained 
a flexibility score of 395.37 


We define code originality as the average distance of a Scratch 
project to a sample of typical projects in our data set, i.e. 
a project that is more distant from typical projects is con- 
sidered more original. To compute the distance between 
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projects, we follow the approach of Price et al. [22]. In 
particular, we use a three-step algorithm to construct an 
alignment between two Scratch projects. First, we compute 
the tree edit distance between the stage syntax trees 
of both programs. Then, we compute all pairwise tree edit 
distances between sprites in both programs. Finally, we feed 
this result into the Hungarian algorithm [15] to obtain an 
alignment between the sprite trees. This is because sprites 
in a Scratch project do not have a clear ordering, making an 
unordered representation more natural. 


While fluency and flexibility are based only on the program 
itself, the originality requires a reference sample of projects, 
i.e. we need a reference point with respect to which a project 
is original or not [20]. To illustrate the effect of the 
reference set, we note that the originality of the ‘Scratch in 
Scratch!’ with respect to a reference set of 3 different project 
groups from the user study was 4488.84, 4168.89, and 2759, 
respectively. 


Visual creativity. To represent visuals, we first collect all 
images (i.e. sprite and stage images) in our training data set 
and feed them into a ResNet50 neural network?] ResNet50 
has been shown to generalize diverse image processing tasks 
and classifications [16]. Accordingly, we hope that the 
representation of ResNet50 also helps to capture the seman- 
tic distance between images for our case. The output is a 
set of vectors, one for each image. To measure distance 
6 between images, we use the Cosine distance 6(z,y) = 


aD . . . . : . . 
1— TUT because it is invariant against effects of scale/size, 
which would otherwise be a confounder in our data. 


We compute visual fluency as the number of images in the 
Scratch projects, which is equivalent to the fluency definition 
of Torrance |23}). For example, in the project ‘Scratch in 
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Scratch!’ we have 47 images and that is its fluency score. 


To measure visual flexibility, we use the same approach as 
for code flexibility, ie. we compute the sum }7, >7,, 5(z, y) 
of all pairwise distances over images in the project and then 
divide by the number of images. To illustrate, the program 
‘Scratch in Scratch’ contains 2 similar button images ‘Save’ 
and ‘Load’ (see Figure |1). The Cosine distance between 
these images is relatively small 5(/Save’,’ Load’) = 0.19, 
based on the vectors created by the ResNet50 network. The 
flexibility score of the visual aspect of the project is 12.42. 


We compute visual originality as the average distance of 
each image in a project to images in typical projects. To 
illustrate, the project ‘Scratch in Scratch!’ received an orig- 
inality score of 0.57, 0.57 and 0.58 when using 3 different 
reference sets. 


4. HUMAN CREATIVITY ASSESSMENT 


In this section, we describe a user study to collect expert 
evaluations of creativity of Scratch projects. The experts 
were Scratch instructors without prior knowledge in creativ- 
ity theory. Each expert was assigned a set of pre-selected 
Scratch projects and asked to separately evaluate the cre- 
ativity of projects according to four different aspects: code, 
visual, audio and idea behind the project, which was iden- 
tified in past work as an important factor in the creative 
process in Scratch [18]. 


We designed an online application to facilitate the rating 
process and to allow the experts to play and review each 
project as they see fit. The application was divided into 
three main screens. The Home Screen displays all of the 
projects that are assigned to an expert. When clicking on 
a project in the Home Screen, experts were able to see ad- 
ditional information about the project (e.g., the number of 
views and likes that the project received) and information 
about the user (e.g., country, date of registration in Scratch, 
and age if available) and also a link to the editor environ- 
ment for the project’s code and visuals and the embedded 
playable project. 


For each project, experts were asked to answer questions 
that relate to the creativity of the four aspects of the Scratch 
project. Questions relating to visual aspects, such as whether 
the project contained images provided by scratch or origi- 
nally created by the user. Experts were also asked to rate 
the novelty/quality and effort put into the visual aspects of 
the project. Questions relating to the project code, such as 
evaluating the code complexity, efficiency and novelty, and 
rating the effort put into the code. Questions about the 
project idea asked to include a short description of the idea 
and ratings for how much novelty and effort were required 
for developing the idea. If the project included sounds, ex- 
perts were asked if these sounds were recorded by the user, 
imported or were provided by Scratch. Additionally, the ex- 
perts rated the novelty of the sounds and the effort invested 
in the audio aspect. 


Experts were also asked to provide a creativity score for each 
of the aspects (0-100), shown in Figure [3| as well as provide 
a weight (between 0 and 1, summing to 1) for each aspect 
according to its subjective importance in determining the 
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Figure 3: Overall creativity assessment. 


Table 1: Experts grading statistics of code, visual and final 
creativity scores 


Expert Statistic Code Visual Final score 
1 Mean 69.55 75.85 67.59 
SD 24.04 24.97 21.31 
9 Mean 66.75 67.70 66.89 
SD 13.11 14.28 13.80 
3 Mean 70.75 = 77.65 65.30 
SD 10.18 10.50 11.89 
4 Mean 72.90 83.40 76.11 
SD 24.00 20.11 17.78 
5 Mean 64.60 68.55 63.52 
SD 15.27 13.15 13.17 


creativity of a project. The creativity score of a given project 
for an expert is computed as the weighted summation of the 
creativity ratings for each aspect. The creativity score for 
each project is shown in the Home Screen, allowing experts 
to compare the scores and revise them at will. 


4.1 User Study 


We recruited 5 experts from 4 countries: Cuba, Vietnam, In- 
dia and Israel. All experts had at least two years of Scratch 
training experience to students of different ages in schools 
and after-school activities. We selected 45 unique projects 
of different types (games and stories), created by different 
users (age ranged between 9 to 18, from 25 different coun- 
tries, and with different experience, from 4 to 258 projects). 
We uniformly sampled projects to each of the experts from 
this set, so that there is a sufficient spread of creativity as- 
sessments across project, while still having some projects 
being rated by several experts. Four of the experts evalu- 
ated 20 projects, while one evaluated 10 projects. 
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Table[I] presents the statistics of the scores for code, visuals, 
and the final creativity scores provided by the experts. We 
note that the highest scores were provided by expert 4 and 
that this expert as well as expert 1 had the highest standard 
deviation across all aspects. Experts 2 and 5, on the other 
hand, gave relatively lower scores with a lower standard de- 
viation. 


4.2 Agreement between experts 

Experts differed widely in the creativity scores they assigned 
to projects. For example, experts 1, 2, and 3 all evaluated 
the project ‘Scratch in Scratch!’. Expert 1 gave this project 
a creativity score of 91 for the code aspect, while expert 2 
gave it a score of 67, and expert 3 gave a score of 82. 


We note that the low agreement between raters should not 
signify a mistake or lack of expertise. It reflects the fact 
that creativity assessment is largely subjective, and that ex- 
perts can differ about which aspects are more or less im- 
portant when measuring creativity, as we show in this sec- 
tion. To compensate for this, we measure agreement using 
the Kendall Rank Correlation Coefficient [I]. This measure 
ranges from —1 (complete disagreement between rankings) 
to 1 (perfect match) and is determined based on overlapping 
projects for each pair of experts. 


F igure|4]displays for each pair of experts the number of over- 
lapping projects (in parentheses) and the Kendall-7 score. 
Note that experts 2 and 4 had only one overlapping project, 
therefore the Kendall-7 score cannot be calculated for them. 
As shown in the figure, the highest agreement (Kendall- 
tT = 0.67) is between expert 5 and 2. The other positive 
agreement scores are much lower and vary between 0.2 and 
0.33. Moreover, we see four pairs of experts with negative 
Kendall-7 scores, with 2 of them including expert 4. 


The experts with the highest agreement score (experts 2 and 
5) also exhibited similar scores for code and visual aspects 
(See Table (1p, suggesting that they interpret creativity for 
these aspects in similar ways. However, the same expert 
5 commonly disagreed with expert 1 (Kendall-r = —0.6). 
Their scores and rankings of overlapping projects differed 
substantially, suggesting that they differ in their interpreta- 
tion of creativity. For example for the code aspect, the same 
project was ranked 1st by expert 5 and 16th by expert 1. 


We observe that most experts found the visual aspect more 
significant than the code aspect when evaluating creativity. 
For the majority of experts, the project idea was the most 
important aspect. By contrast, experts assigned low weights 
to audio aspects. We note that the project idea is very 
difficult to model computationally. This is an interesting 
avenue to explore in future work. 


5. PREDICTING CREATIVITY SCORES 


In this section we report on the design and evaluation of 
a computational model to predict the creativity scores of 
Scratch projects. We build an automatic tool to support 
teachers (and students) in Scratch that can be trained on 
examples taken from individual or multiple experts. 


We use an XGBoost Regressor [4] to predict the expert cre- 
ativity scores for each project. As input features we used the 
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Figure 4: Kendall-7 Agreement between pairs of experts with 
overlapping projects on the final creativity score. (The num- 
ber of overlapping projects is shown in parentheses.) 


originality, flexibility, and fluency measures for both visual 
and code aspects, as described in Section [3| This provided 
us with 6 features for each instance (project). The reference 
sample for computing originality included all of the projects 
that the expert rated. 


We created 2 types of XGBoost models: (1) a single rater 
model trained on projects for each expert separately and (2) 
a combined model trained on the projects from all experts 
together. Note that for the combined model, projects that 
were evaluated by more than one rater were treated as dif- 
ferent instances. For each type of model, we created three 
different prediction models (a) predicting the code creativity 
score. (b) predicting the visual creativity score. (c) predict- 
ing the overall project score by the weighted combination 
score (visual and code). The features consisted of the origi- 
nality, flexibility and fluency for the code aspect (model a), 
the visual aspect (model b), or both (model c). 


The combined model and the single rater models were de- 
veloped using the official implementation of XGBoosf'| We 
selected the hyperparameters based on the structure of our 
data. We set the upper complexity limit of the model to six 
trees for the rater with 10 projects, 14 for the rest of the 
raters, and 29 trees for the combined raters and the maxi- 
mum tree depth based on the number of features. The com- 
bined model was evaluated using 10-fold cross-validation. 
The single rater models were evaluated using 5 folds to en- 
sure that the test set contained at least two projects. 


Because of the high degree of variance between the raters, we 
do not seek to minimize error with respect to the predicted 
creativity score. Instead, we compare rankings. Ideally, we 
would compare the rankings of the projects in the test set 
with the true rankings for each expert. However, the size 
of the test set for some folds for some of the experts was 
small (4 projects for most experts). To increase the number 
of comparisons for Kendall-r, we built a complete ranking 
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Table 2: Kendall-r score between XGBoost Regressor and 
experts scores - code creativity score, visual creativity score, 
and the weighed visual and code score 


Experts Kendall-r 
Expert Code Visual Weighed visual and code 
1 0.52 0.52 0.42 
2 0.51 0.42 0.42 
3 0.53 0.58 0.36 
4 0.46 0.52 0.57 
5 0.52 0.42 0.50 
Combined 0.43 0.44 0.42 


over all projects for an expert, by combining the predicted 
scores of projects in the test-set with the scores of projects 
in the training set. However, we compute the Kendall-r 
agreement only for project pairs with at least one project in 
the test set. We make a similar computation with respect 
to computing the Kendall-7 for the combined model. 


Table [2] presents the Kendall-r performance, computed as 
described above. As seen from the table, when predicting 
the creativity score for visual and code aspects, we achieve 
a Kendall-r score of 0.51 and above for 3 out of 5 experts. 
When predicting the weighted creativity score, we achieve 
a Kendall-7 score of over 0.42 for 4 out of 5 experts. Over- 
all, the agreement measure is higher than that of the inner- 
agreement between the experts themselves that is reported 
in Figure [4] (except for the pair 2 and 5). 


The bottom row in Table[2] presents the results for the XG- 
Boost model that is trained over the combined set for all ex- 
perts. In all cases we achieve a Kendall-t score above 0.42, 
which is higher than the inner-agreement scores for most 
pairs of experts. For visual aspects, the combined model is 
less successful than the individual models. In contrast, for 
the visual creativity, the combined model is better than the 
single rater model for two of the experts (experts 2 and 5); 
for weighted visual and code creativity score, the combined 
model is better or equal than the single rater model for 3 
experts (experts 1, 2 and 3). This suggests that our models 
can define useful rules for aggregating creativity rankings by 
different experts despite the disagreement between them. 


6. DISCUSSION AND CONCLUSION 


In this paper we presented a formalization of creativity in 
terms of fluency, flexibility, and originality. We automati- 
cally computed creativity both for code and for visual as- 
pects of Scratch projects and we intend to add the other 
possible modalities of that environment to our future work. 
Further, we set up a web application to rate the creativity 
in Scratch projects independently of our formalization. Fi- 
nally, we recorded the ratings of five human experts on 45 
Scratch projects via this application. 


We observed that human raters tend to disagree on which 
projects are creative and which are not. Still, we were able 
to train regression forests, which achieved a higher ranking 
agreement with the human raters than they achieved with 
each other, and which only used the automatically generated 
ratings as input. We observed that the regression forest 


model could further improve its accuracy when being applied 
to individual experts instead of their shared data. 


Our approach makes a step towards supporting teacher’s 
abilities to detect and support creative outcomes in stu- 
dents’ work. Ample future work is still to be done. Fur- 
thermore, we plan to analyze creativity ratings over time, 
thus tracking students’ creative learning process. Future 
work will also need to address how an automatic assessment 
of creativity can support creativity at scale in technolog- 
ical environments, taking into account different subjective 
interpretations. 
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ABSTRACT 


Peer assessment has been widely applied across diverse aca- 
demic fields over the last few decades, and has demonstrated 
its effectiveness. However, the advantages of peer assess- 
ment can only be achieved with high-quality peer reviews. 
Previous studies have found that high-quality review com- 
ments usually comprise several features (e.g., contain sug- 
gestions, mention problems, use a positive tone). Thus, re- 
searchers have attempted to evaluate peer-review comments 
by detecting different features using various machine learn- 
ing and deep learning models. However, there is no single 
study that investigates using a multi-task learning (MTL) 
model to detect multiple features simultaneously. This pa- 
per presents two MTL models for evaluating peer-review 
comments by leveraging the state-of-the-art pre-trained lan- 
guage representation models BERT and DistilBERT. Our 
results demonstrate that BERT-based models significantly 
outperform previous GloVe-based methods by around 6% in 
Fl-score on tasks of detecting a single feature, and MTL 
further improves performance while reducing model size. 


Keywords 
Peer assessment, peer feedback, automated peer-assessment 
evaluation, text analytics, educational data mining 


1. INTRODUCTION 


Peer assessment is a process by which students give feedback 
on other students’ work based on a rubric provided by the 
instructor [20, 24]. This assessment strategy has been widely 
applied across diverse academic fields, such as computer sci- 
ence [28], medicine [27], and business [1]. Furthermore, mas- 
sive open online courses (MOOCs) commonly use peer as- 
sessment to provide feedback to students and assign grades. 
There is abundant literature [7, 24, 25, 11] demonstrating 
the efficacy of peer assessment. For example, Doubling et al. 
[7] conducted a meta-analysis of 54 controlled experiments 
for evaluating the effect of peer assessment across subjects 
and domains. The results indicate that peer assessment is 
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more effective than teacher assessment, and also remarkably 
robust across a wide range of contexts [7]. 


However, low-quality peer reviews are a persistent problem 
in peer assessment, and considerably weaken the learning 
effect [17, 22]. The advantages of peer assessment can only 
be achieved with high-quality peer reviews [14]. This sug- 
gests that peer reviews should not be simply transmitted 
to other students but rather should be vetted in some way. 
Course staff could check the quality of each review comment? 
and assess its credibility manually, but this is not efficient. 
Sometimes (e.g., for MOOCs), this is not remotely possi- 
ble. Therefore, to ensure the quality of peer reviews and 
the efficiency of evaluating their quality, the peer-assessment 
platform should be capable of assessing peer reviews auto- 
matically. We call this Automated Peer-Review Evaluation. 


Previous research has determined that high-quality review 
comments usually comprise several features [14, 2, 25]. Ex- 
amples of such features are, “contains suggestions”, “men- 
tions problems”, “uses a positive tone”, “is helpful”, “is local- 
ized” [14]. Thus, one feasible and promising way to evalu- 
ate peer reviews automatically is to adjudicate the quality 
of each review comment based on whether it comprises the 
predetermined features, by treating this task as a text classi- 
fication problem. If a peer-review comment does not contain 
some of the features, the peer-assessment platform could 
suggest that the reviewer should revise the review comment 
to add missing features. Additionally, containing sugges- 
tions, mentioning problems, and using a positive tone, are 
among the most essential features. Thus, we use them for 
this study. 


Previous work for automatically evaluating review comments 
has focused on tasks that detect a single feature. For exam- 
ple, Xiong and Litman [33] designed sophisticated features 
and used traditional machine-learning methods for identify- 
ing peer-review helpfulness. Zingle et al. [37] utilized differ- 
ent rule-based, machine-learning, and deep-learning meth- 
ods for detecting suggestions in peer-review comments. How- 
ever, to the best of our knowledge, no single study exists 
that investigates using a multi-task learning (MTL) model 
to detect multiple features simultaneously (as illustrated in 
Figure 1), albeit extensive research has been carried out on 


‘In some peer-assessment systems, reviews are “holistic”. In 
others, including the systems we are studying, each review 
contains a set of review comments, each comment gives a 
response to a different criterion in the rubric. 
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Figure 1: Illustration of the single-task and multi-task learning settings 


the topic of automated peer-review evaluation (e.g., [34, 32, 
33, 31, 30, 37, 13, 19, 8]). 


There are at least two motivations for using multi-task learn- 
ing (MTL) to detect features simultaneously. Firstly, the 
problem naturally leads itself well to MTL, due to multiple 
features usually needing to be employed for a comprehensive 
and precise evaluation of peer-review comments. If we treat 
this MTL problem as multiple independent single tasks, to- 
tal model size and prediction time will increase by a factor of 
the number of features used for evaluating review comments. 
Secondly, MTL can increase data efficiency. This implies 
that learning tasks jointly can lead to performance improve- 
ment compared with learning them individually, especially 
when training samples are limited [5, 36]. More specifically, 
MTL can be viewed as a form of inductive transfer learn- 
ing, which can help improve the performance of each jointly 
learned task by introducing an inductive bias [3]. 


Additionally, the pre-trained language model, BERT (Bidi- 
rectional Encoder Representations from Transformers) [6], 
has become a standard tool for reaching the state of the art 
in many natural language processing (NLP) tasks. BERT 
can significantly reduce the need for labeled data. There- 
fore, we propose multi-task learning (MTL) models for eval- 
uating review comments by leveraging the state-of-the-art 
pre-trained language representation models BERT and Dis- 
tilBERT. We first compare a BERT-based single-task learn- 
ing (STL) model with the previous GloVe-based STL model. 
We then propose BERT and DistilBERT based MTL models 
for jointly learning different tasks simultaneously. 


The rest of the paper is organized as follows: Section 2 
presents related work. Section 3 describes the dataset used 
for this study. The proposed single-task and multi-task text 
classification models are elaborated in Section 4. Section 5 
details the experimental setting and results. In Section 6, we 
conclude the paper, mention the limitations of our research, 
and discuss future work. 


2. RELATED WORK 


2.1 Automated Peer-Review Evaluation 

The earliest study on automated peer-review evaluation was 
performed by Cho in 2008 [4]. They manually broke down 
every peer review comment into review units (self-contained 


messages in each review comment) and then coded them 
as praise, criticism, problem detection, solution suggestion. 
Cho [4] utilized traditional machine learning methods, in- 
cluding naive Bayes, support vector machines (SVM), and 
decision trees, to classify the review units. 


Xiong et al. attempted to use features (e.g., counts of nouns, 
verbs) derived from regular expressions and dependency parse 
trees and rule-based methods to detect localization in the 
review units [32]. Then, they designed more sophisticated 
features by combining generic linguistic features mined from 
review comments and specialized features, and used SVM to 
identify peer-review helpfulness [33]. After that, Xiong et al. 
upgraded their models to comment-level (use whole review 
comment instead of review units as the input) [15, 16]. 


Then, researchers started to use deep neural networks on 
tasks of automated peer-review evaluation for improving ac- 
curacy. Zingle et al. compared rule-based machine-learning 
and deep neural-network methods for detecting suggestions 
in peer assessments, and the result showed that deep-learning 
methods outperformed other traditional methods [37]. Xiao 
et al. collected around 20,000 peer-review comments and 
leveraged different neural networks to detect problems in 
peer assessments [31]. 


2.2 Multi-Task Learning 


Multi-task learning (MTL) is an important subfield of ma- 
chine learning in which multiple tasks are learned simulta- 
neously [35, 5, 3] to help improve the generalization perfor- 
mance of all the tasks. A task is defined as {p(x), p(y|x), L)}, 
where p(x) is the input distribution, p(y|x) is the distribu- 
tion over the labels given the inputs, and L is the loss func- 
tion. For the MTL setting in this paper, all tasks have the 
same input distribution p(x) and loss function L, but differ- 
ent distributions over the labels given the inputs p(y|z). 


In the context of deep learning, all methods of MTL can 
be partitioned into two groups: hard-parameter sharing and 
soft-parameter sharing [3]. For hard-parameter sharing, the 
hidden layers are shared between all tasks while keeping sev- 
eral task-specific output layers. For soft-parameter sharing, 
each task has its independent model, but the distance be- 
tween the different models’ parameters is regularized. For 
this study, we use the hard-parameter sharing approach. 
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Table 1: Sample Rubric Criteria 


Does the design incorporate all of the functionality required? 
Have the authors converted all the cases discussed in the test plan into automated tests? 
Does the design appear to be sound, following appropriate principles and using appropriate patterns? 


Table 2: Sample Data 


Peer-Review Comments (lower-cased) Sugg. Prob. Tone 
lots of good background details is given but the testing and implementation sections are missing. 0 1 1 
the explanation is clear to follow but it could also include some explanation of the use cases. 1 0 1 
only problem statement is explained and nothing about design. please add design and diagrams. 1 1 0 


3. DATA 


3.1 Data Source: Expertiza?’ 

The data in this study is collected from an NSF-funded 
peer-assessment platform, Expertiza. In this flexible peer- 
assessment system, students can submit their work and peer- 
review the learning objects (such as articles, code, and web- 
sites) of other students [9]. This platform supports multi- 
round peer review. In the assignments that provided the 
review comments for this study, two rounds of peer review 
(and one round of meta-review) were used: 


1. The formative-feedback phase: For the first round of 
review, students upload substantially complete projects. 
The system then assigns each student to review a set 
number of these submissions, based on a rubric pro- 
vided. Sample rubric criteria are provided in Table 
1. 


2. The summative-feedback phase: After students have 
had an opportunity to revise their work based on feed- 
back from their peers, final deliverables are submit- 
ted and peer-reviewed using a summative rubric. The 
rubric may include criteria such as “How well has the 
team has addressed the feedback given in the first re- 
view round?”. Many criteria in the rubric ask review- 
ers to provide a numeric rating as well as a textual 
comment. 


3. The meta-review phase: After the grading period is 
over, course staff typically assess and grade the reviews 
provided by students. 


For this study, all textual responses to the rubric crite- 
ria from the formative-feedback phase and the summative- 
feedback phase of a graduate-level software-engineering course 
are extracted to constitute the dataset. Each response to a 
rubric criterion constitutes a peer-review comment. All re- 
sponses from one student to a set of criteria in a single rubric 
are called a peer review or a review. In this study, we fo- 
cus on evaluating each peer-review comment. After filtering 
out review comments that only contain symbols and special 
characters, the dataset consists of 12,053 review comments. 
In the future, we will update the platform, and this type of 
review comments will be rejected by the system directly. 


3.2 Annotation Process 

One annotator who is a fluent English speaker and familiar 
with the course context annotated the dataset. For qual- 
ity control, 100 reviews were randomly sampled from the 


“https: / /github.com/expertiza/expertiza 


Table 3: Inter-Annotator Agreement (Cohen’s «) 
Label Suggestion Problem ‘Tone | Average 
Cohen’s Kappa | 0.92 0.84 0.87 | 0.88 


dataset and labeled by a second annotator who is also a 
fluent English speaker and familiar with the course con- 
text. The inter-annotator agreement between two annota- 
tors was measured by Cohen’s « coefficient, which is gen- 
erally thought to be a more robust measure than simple 
percent agreement calculation [12]. Cohen’s « coefficient for 
each label is shown in Table 3. The result suggests that the 
two annotators had almost perfect agreement (>0.81) [12]. 
Sample annotated comments are provided in Table 2. 


We define each feature (label) in the context of automated 
peer-review evaluation as follows: 


Suggestion: A comment is said to contain a suggestion if it 
mentions how to correct a problem or make improvements. 


Problem: A comment is said to detect problems if it points 
out something that is going wrong in peers’ work. 


Positive Tone: A comment is said to use a positive tone if it 
has an overall positive semantic orientation. 


3.3 Statistics on the Dataset 


The minority class for each label includes more than 20% 
of samples, and thus the dataset is mildly imbalanced. It 
consists of 12,053 peer-review comments, and the average 
number of words for each peer-review comment is 29. We 
found that most students (over three-quarters) use a pos- 
itive tone in their peer-review comments. Around half of 
the review comments mention problems with their peers’ 
work, but only one-fifth of review comments give sugges- 
tions. Characteristics of the dataset are shown in Table 4 
below, 


Table 4: Statistics on the Dataset 


Label Class %samples avg.4¢words max#words 
Sugg. 0 79.2% 22 922 
a 20.8% 58 1076 
Prob. 0 56.7% 22 479 
1 43.3% 38 1076 
Pos. Tone 0 22.2% 28 1040 
1 77.8% 29 1076 
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4. METHODOLOGY 

In this section, we first briefly introduce Transformer [26], 
BERT [6], and DistilBERT [21]. Then we describe BERT 
and DistilBERT based single-task and multi-task models. 


4.1 Transformer 

In 2017, Vaswani et al. published a groundbreaking paper, 
“Attention is all you need,” and proposed an architecture 
called Transformer, which significantly improved the perfor- 
mance of sequence-to-sequence tasks (e.g., machine trans- 
lation) [26]. The Transformer is entirely built upon self- 
attention mechanisms without using any recurrent or con- 
volutional layers. As shown in Figure 2, the Transformer 
consists of two parts: the left part is an encoder, and the 
right part is a decoder. The encoder block takes a batch of 
sentences represented as sequences of word IDs. Then the 
sequences pass through an embedding layer, and the posi- 
tional embedding adds positional information of each word. 


Output 
Probabilities 


Positional 
Encoding 


Positional 
Encoding EQ) © 


Inputs Outputs 
(shifted right) 


Figure 2: Architecture of the Transformer [26] 


The encoder block is then briefly introduced since BERT 
reuses it. Each encoder consists of two layers: a multi-head 
attention layer and a feed-forward layer. The multi-head 
attention layer uses the self-attention mechanism, which en- 
codes each word’s relationship with every other word in the 
same sequence, paying more attention to the most relevant 
ones. For example, the output of this layer for the word 
“like” in the sentence, “we like the Educational Data Mining 
conference 2021!” will depend on all the words in the sen- 
tence. However, it will probably pay more attention to the 
word “we” than to the words “data” or “mining.” 


4.2 BERT 


BERT is a state-of-the-art pre-trained language representa- 
tion model proposed by Devlin et al. [6]. It has advanced 
the state-of-the-art results in many NLP tasks and signif- 
icantly reduced the need for labeled data by pre-training 
on unlabeled data over different pre-training tasks. Each 
BERT model consists of 12 encoder blocks of the Trans- 


former model. The input representation is constructed by 
summing the corresponding token and positional embed- 
dings. The length of the output sequence is the same as 
the input length, and each input token has a corresponding 
representation in the output. The output of the first token 
‘ICLS]’ (a special token added to the sequence) is utilized 
as the aggregate representation of the input sequence for 
classification tasks [6]. 


The BERT framework consists of two steps: pre-training 
and fine-tuning. During pre-training, the model is trained 
on unlabeled data, BooksCorpus (800M words) and En- 
glish Wikipedia (2,500M words), over two pre-training tasks, 
Masked language model (MLM) and Next sentence predic- 
tion (NSP). For fine-tuning, the BERT model is first ini- 
tialized with the pre-trained parameters, and then all of 
the parameters are fine-tuned using labeled data from the 
downstream tasks (e.g., text classification). For this study, 
we use HuggingFace pre-trained BERT® to initialize mod- 
els and then fine-tune models with annotated peer-review 
comments for automated peer-review evaluation tasks. 


4.3 DistiIBERT 


Although BERT has shown remarkable improvements across 
various NLP tasks and can be easily fine-tuned for down- 
stream tasks, one main drawback of BERT is that it is very 
compute-intensive (i.e., it takes a huge amount of param- 
eters, ~110M parameters). Therefore, researchers are at- 
tempting to apply different methods for compressing BERT, 
including pruning, quantization, and knowledge distillation 
[10]. One of the compressed BERT models is called Dis- 
tilBERT [21]. DistiIBERT is compressed from BERT by 
leveraging the knowledge distillation technique during the 
pre-training phase. The authors [21] demonstrated that 
DistiIBERT has 40% fewer parameters and is 60% faster 
than the original BERT while retaining 97% of its language- 
understanding capabilities. We will investigate whether we 
can reduce model size while retaining performance for our 
task with DisilIBERT.* 


4.4 Input Preparation 

Text Preprocessing: First, URL links in peer-review com- 

ments are removed. Then, we lowercase all comments and 

leverage a spellchecker API” to correct typos and misspellings. 
Finally, two special tokens ([CLS], [SEP]) are added to each 

review comment, as required for BERT. The [CLS] token 

is added to the beginning of each review for classification 

tasks. The [SEP] token is added at the end of each review. 


Subword Tokenization: The tokenizer used for BERT is a 
subword tokenizer called “WordPiece” [29]. Traditional word 
tokenizers suffer the out-of-vocabulary (OOV) word prob- 
lem. However, a subword tokenizer could alleviate the OOV 
problem. It splits a text into subwords, which then are con- 
verted to token IDs. 


Input Representation: The token IDs are padded or trun- 
cated to 100 for each sequence and then pass through a 
trainable embedding layer to be converted to token embed- 
dings. The input representation for BERT is constructed by 
summing the token embeddings and positional embeddings. 


2https: //huggingface.co /bert-base-uncased 
“https: //huggingface.co/distilbert-base-uncased 


https: //pypi.org/project/pyspellchecker/ 
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Figure 3: BERT and DistilBERT based single-task and multi-task learning architectures 


4.5 Single-Task and Multi-Task Models 

As mentioned in the BERT paper [6] and other studies [23], 
the pre-trained BERT model can be fine-tuned with just one 
additional output layer to create state-of-the-art models for 
a wide range of tasks, including text classification. There- 
fore, only one dense layer is added on top of the original 
BERT or DistilIBERT model and used as a binary classifier 
for the single-task learning models. Three dense layers are 
added to the multi-task learning models, one for each label. 


5. EXPERIMENTS AND RESULTS 


In this section, we first introduce training details and eval- 
uation metrics and then show experimental results. 


5.1 Training 

Train/Test split: We find by experiments that increasing 
training size does not help the classifier when the number of 
training samples is over 5000. Therefore, 5000/2053/5000 
data samples are used for training/validation/testing. 


Loss Functions: For BERT and DistilBERT based Single- 
Task Learning (STL) models, the cross-entropy loss is used. 
For BERT and DistilIBERT base Multi-Task Learning (MTL) 
models, the cross-entropy loss is used for each task. The to- 
tal loss will be the sum of the cross-entropy loss of each 
task. 


Cost-Sensitive method: As mentioned in Section 3.3, the 
dataset is mildly imbalanced (minority class > 20%). Thus, 
a cost-sensitive method is used in this study for alleviating 
the problem of class imbalance and improving performance, 
by weighting the cross-entropy loss function during training 
based on the frequency of each class in the training set. 


Hyperparameters: As we mentioned in Section 4.2 and Sec- 
tion 4.3, we use HuggingFace pre-trained BERT and Distil- 
BERT to initialize models. The hidden size for BERT and 
DistilBERT is 768. We then fine-tune the BERT and Dis- 
tilBERT based single-task learning and multi-task learning 
models with a batch size of 32, max sequence length 100, 
learning rate 2e-5/3e-5/5e-5, epochs of 2/3, dropout rate 
0.1, and Adam optimizer with 3;=0.9 and 82=0.99. 


5.2 Evaluation Metrics 

We use accuracy, macro-F 1 score (average for each class of 
each label instead of each label), and AUC (Area Under 
ROC Curve) to evaluate models. Since the dataset is merely 
mildly imbalanced, accuracy can still be a useful metric. The 
Macro-F 1 instead of F1-score for the positive class is used, 
since both positive class and negative class for each label 
are important for our task. For this study, we mainly use 
accuracy and macro-F1 to compare different models. 


5.3 Results 


Table 5 shows the performance of all models when train- 
ing with a different number of training samples (1K, 3K, 
and 5K). The first column indicates the models (GloVe, 
BERT, DistilBERT) and training settings (single-task learn- 
ing (STL), multi-task learning (MTL)). 


RQ1 Does BERT outperform previous methods? 

We first implemented a baseline single-task learning model 
by leveraging pre-trained GloVe (Global Vectors for Word 
Representation)® [18] word embeddings. We added a Batch- 
Normalization layer on top of GloVe, and the aggregate 
representation of the input sequence for classification was 
obtained by AveragePooling the output of the BatchNor- 
malization layer. A dense layer was added on the top for 
performing classification. 


We compared GloVe and BERT for every single task. As 
shown in Table 5, the results clearly showed that a BERT- 
based STL model yields substantial improvements over the 
previous GloVe-based method. The STL-BERT model trained 
with 1000 data samples outperformed the STL-GloVe model 
trained with 5000 data samples on all tasks. This suggests 
that the need for labeled data could be significantly reduced 
by leveraging a pre-trained language model BERT. 


RQ2 How does multi-task learning perform? 

By comparing MTL-BERT with STL-BERT and MTL-Distil- 
BERT with STL-DistilBERT when trained with a different 
number of training samples, we found that jointly learning 


Shttps://nlp.stanford.edu/projects/glove/ 
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Table 5: Performance evaluation (average performance of 5 independent runs) 


Suggestion Problem Pos. Tone 

Acc. Macro-Fl1 AUC | Acc. Macro-Fl AUC | Acc. Macro-Fl1 AUC 
Training with 1000 labeled data samples 
1 STL-GloVe (Baseline) | 82.0% 744 865 | 80.2%  .790 879 | 76.0% -700 823 
2 STL-BERT 90.0% .868 975 | 89.2%  .892 955 | 87.0% .828 .940 
3 MTL-BERT 94.0% 904 974 | 89.0%  .890 955 | 89.4% 846 941 
4 STL-DistilBERT 92.4% .890 970 | 88.0%  .880 .950 | 86.2% 822 .933 
5 MTL-DistilIBERT 93.8% .910 971 | 89.0%  .886 951 | 88.6% 824 .939 
Training with 3000 labeled data samples 
1 STL-GloVe (Baseline) | 88.4% .836 929 | 83.0%  .830 898 | 82.4% 770 872 
2 STL-BERT 93.8% 910 980 | 90.6%  .904 964 | 89.6% .858 948 
3 MTL-BERT 94.6% 916 981 | 91.0%  .906 964 | 90.0% 854 .947 
4 STL-DistilBERT 94.0% 910 979 | 89.8%  .900 .962 | 89.0% .850 .942 
5 MTL-DistilBERT 94.2% 916 978 | 89.6%  .892 .960 | 90.2% .850 .945 
Training with 5000 labeled data samples 
1 STL-GloVe (Baseline) | 89.9% 852 947 | 84.2%  .832 908 | 85.0% 794 883 
2 STL-BERT 94.4% 916 980 | 91.2%  .912 968 | 89.4% 852 .950 
3 MTL-BERT 94.8% -922 -982 91.0%  .908 .966 90.8% 854 -951 
4 STL-DistilBERT 94.2% 912 978 | 90.4%  .902 .964 | 89.8% .860 .944 
5 MTL-DistilIBERT 94.2% 914 980 | 90.4%  .902 .964 | 90.6% 852 951 


Table 6: The # of parameters for each setting 


Setting # of parameters 
STL-BERT * 3 328M 
STL-DistiIBERT * 3 199M 
MTL-BERT 109M 
MTL-DistilBERT 66M 


related tasks improves the performance of the suggestion- 
detection task and the positive-tone detection task, espe- 
cially when we have limited training samples (i.e., when 
training with 1K and 3K data samples). This suggests that 
MTL can increase data efficiency. However, for the problem- 
detection task, there is no significant difference between the 
performance of the STL and MTL settings. 


Additionally, MTL can considerably reduce the model size. 
As shown in Table 6, three BERT-based STL models would 
have more than 328M parameters, and this number would 
be 199M for the DistilBert-based models. However, if we 
employ the MTL models for evaluating peer-review com- 
ments, the number of parameters would be reduced to 109M 
and 66M, respectively. This demonstrates that using MTL 
to evaluate reviews can save considerable memory resources 
and reduce the response time of peer-review platforms. 


RQ3 How does DistiIBERT perform? 

By comparing DistilBERT and BERT on both STL and 
MTL settings, we found that BERT-based models slightly 
outperformed DistilBERT-based models. This result im- 
plied a trade-off between performance and model size when 
selecting the model to be deployed on peer-review platforms. 
If we focus on high accuracy instead of memory resource 
usage and response time of the platforms, the MTL-BERT 
model is the choice. Otherwise, the MTL-DistilBERT should 
be deployed. 
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6. CONCLUSIONS 


In this study, we implemented single-task and multi-task 
models for evaluating peer-review comments based on the 
state-of-the-art language representation models BERT and 
DistilIBERT. Overall, the results showed that BERT-based 
STL models yield significant improvements over the previous 
GloVe-based method on tasks of detecting a single feature. 
Jointly learning different tasks simultaneously further im- 
proves performance and saves considerable memory usage 
and response time for peer-review platforms. The MTL- 
BERT model should be deployed on peer-review platforms, 
if our focus is on high accuracy instead of memory resource 
usage and response time of the platforms. Otherwise, the 
MTL-DistilBERT model is preferred. 


There are three limitations to this study. Firstly, we em- 
ployed three features of high-quality peer reviews to eval- 
uate a peer-review comment. However, it is still unclear 
how MTL will perform if we learn more tasks simultane- 
ously. Secondly, we mainly focused on a hard-parameter 
sharing approach for constructing MTL models. However, 
some studies have found that the soft-parameter sharing ap- 
proach might be a more effective method for constructing 
multi-task learning models. Thirdly, the performance of the 
model has not been evaluated in actual classes. We intend to 
deploy the model on the peer-review platform and evaluate 
the model extrinsically in real-world circumstances. 


These preliminary results serve as a basis for our ongoing 
work, in which we are building a more complex all-in-one 
model for comprehensively and automatically evaluating the 
quality of peer review comments to improve peer assessment. 
In the future, we will attempt to evaluate peer reviews based 
on more predetermined features and use fine-grained labels 
(e.g., instead of evaluating whether a peer-review comment 
contains suggestions, we will evaluate how many suggestions 
are contained in a review comment). 
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ABSTRACT 


The prevalence of online education systems provides oppor- 
tunities to deliver personalized learning at scale. Educa- 
tional systems need to assess students so that they can pro- 
vide better curricula tailored to each student’s unique needs. 
Since there is a limited amount of time for quizzing a stu- 
dent, we need to test each student using those questions 
that capture the most information about their level of un- 
derstanding of various concepts. In this paper, we formally 
pose the problem and present multiple approaches for learn- 
ing a quizzing policy to determine a personalized sequence of 
questions for each student that best predicts their knowledge 
state. We first introduce simple heuristics including random 
selection and an uncertainty sampling approach inspired by 
an active learning framework. We then develop a reinforce- 
ment learning (RL) approach for designing a quizzing policy. 
Using simulations of students’ knowledge states, we provide 
initial evidence that an RL-based approach can improve over 
simple heuristics. We further demonstrate the effectiveness 
of our approaches using a real-world dataset consisting of 
over 1.5 million examples of students’ answers to mathe- 
matics questions from Eedi, an online educational platform. 


Keywords 


reinforcement learning, knowledge state, quizzing policy 


1. INTRODUCTION 


Online education systems are making high-quality educa- 
tion more accessible for students across the globe. These 
systems provide various educational resources such as in- 
structional videos and exercises. To provide personalized 
curricula for improving the learning outcomes of students, 
an online education system needs to accurately infer each 
student’s knowledge state (i.e., their level of understanding 
of various concepts) by quizzing them. This is a challeng- 
ing task because the quizzing time is limited. To make the 
most efficient use of each student’s time, it is important to 
prioritize those questions that reveal the most information 
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about the student’s knowledge. 


We focus on a specific goal for student assessment: given a 
limit to the number questions we are allowed to ask each stu- 
dent, how can we determine a sequence of questions for each 
student that best predicts their knowledge state? Specifically, 
when an education system needs to assess a student for in- 
ferring their knowledge, the system suggests a personalized 
question to query for the student and gets their response 
to the question. Based on the student’s response history 
(i.e., a sequence of question-response pairs), the system se- 
lects another question to query for the student until it has 
exhausted its query budget (i.e., the maximum number of 
queries allowed). We refer to the function that provides the 
next question to query based on students’ response histories 
as quizzing policy (QP). 


We define the task of learning a QP in the context of the 
NeurIPS 2020 Education Challenge launched by Eedi 
(6), an online educational platform with thousands of ac- 
tive users daily around the globe. We consider a set of 948 
multiple-choice mathematics questions that correspond to 
57 different concepts. Specifically, the task is to obtain a 
limited set of answers from each student for inferring the 
student’s knowledge on the 57 concepts and then predict 
the student’s performance on unseen questions based on the 
inferred knowledge state. 


The key challenge in designing a QP is related to a cru- 
cial task in machine learning: active learning (AL). For 
many learning tasks (e.g., image classification, text classi- 
fication), obtaining sufficient labeled data for training high- 
performance models is costly . AL aims to reduce 
the amount of annotated data needed by having the model 
carefully select which data points should be labeled. 


Existing methods for AL include heuristics such as select- 
ing the data points about which the model is most uncer- 
tain (i.e., uncertainty sampling) [24], picking 
the instances about which a set of possible different mod- 
els disagree the most (i.e., query by committee) [10], or 
choosing the example that can lead to the most immediate 
improvement in model performance (i.e., estimated error re- 


duction) [12]. 


In addition to these heuristics for AL, recent studies 
[19] [9] have explored how to use reinforcement learning (RL 
to learn the AL strategy itself. RL is a powerful 
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framework where an agent learns how to make good deci- 
sions (actions) in different situations (states) through trial 
and error. In the RL terminology, the action space provides 
the set of actions that can be taken by the RL agent at a 
given point in time; the state space defines the “state of the 
world” that is visible to the RL agent; and the reward func- 
tion assigns a value to the outcome of each action taken by 
the RL agent. In this case, the set of possible instances to 
be labeled defines the action space; the state space is a rep- 
resentation of the sequence of instances that have already 
been annotated; and the gain in prediction accuracy as a re- 
sult of an action defines the reward. The RL agent learns to 
improve its decision-making over time based on the reward 
signals it receives. Inspired by these studies, we investigate 
using RL to learn a QP for personalized student assessment. 


1.1 Our approach and contributions 

In this paper, we formalize the problem of learning a QP for 
inferring the student knowledge state and present several 
different approaches including simple heuristics and an RL- 
based approach. Our contributions are: 


e We formulate the problem of learning a QP to infer stu- 
dent knowledge. 


e We propose simple heuristics (i.e., random selection, un- 
certainty sampling) and an RL-based approach for learn- 
ing a QP. 


e We evaluate the performance of different QPs on a syn- 
thetic dataset and a publicly available dataset consisting 
of over 1.5 million examples of students’ answers to math- 
ematics questions from Eedi. 


For the reproducibility of experimental results and facilitat- 
ing research in this area, the code and dataset are publicly 
available[:] 


1.2. Related work 


AL is a popular methodology in machine learning that aims 
to reduce the amount of annotated data needed by hav- 
ing the model carefully select which data points should be 
labeled. The task of designing a QP is closely related to 
AL because the goal is to optimally select a set of ques- 
tions to ask students to gain the most information about 
their knowledge states. Uncertainty sampling 
is one of the most popular heuristics for AL because it is 
straightforward and computationally efficient. Specifically, 
it suggests labeling instances that are closest to the model’s 
decision boundary (i.e., the most uncertain). Woodward and 
Finn propose the first application of RL to the task of 
AL for image classification. Other studies [9| explore 
how to train an AL policy that can generalize across diverse 
datasets. 


RL has also been applied to various tasks in education such 
as learning an instructional policy [28], 
learning a hint policy for helping students solve multi-step 
problems (7, and generating new educational tasks fi]. We 
introduce a different policy, a quizzing policy for inferring 
the student knowledge state, which has not been designed 
using RL in previous literature. 


https://github.com/joyheyueya/quizzing-policy 
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Figure 1: Graphical representation of knowledge. This is 
an example of an undirected graph where each node (circle) 
represents a concept, and each edge connects a pair of similar 
concepts: ci is an independent concept, cz and c3 are similar, 
and c4, Cs, and cg are similar. 


There is prior work on the efficient assessment of knowledge 
ig). Our student knowledge model is inspired by the knowl- 
edge components (i.e., concepts / skills) used in Bayesian 
Knowledge Tracing (BKT) [4], which represents the state 
for each knowledge component as a binary variable: 1 if the 
knowledge component is known, 0 otherwise. 


2. PROBLEM FORMULATION 


In this section, we formalize the problem of learning a quizzing 
policy (QP) for inferring the student knowledge state. 


2.1 Student knowledge state 


Our goal is to infer student knowledge on a set of n con- 
cepts C = {c1,...,Cn} associated with a set of m questions 
X = {x1,...,%m}. For simplicity, each question corresponds 
to a single concept, but each concept might be associated 
with more than one question (m >> n). A student’s knowl- 
edge state h is defined as h = [v1,...,Un] where v1, ...,Un are 
binary variables that indicate whether or not the student 
knows each concept in C: v; = 1 if c is known, and v; = 0 
otherwise. Formally, we define a hypothesis space H for all 
possible knowledge states: # = {0,1}". We assume h is 
fixed during the assessment. 


2.2 Graphical representation of knowledge 
We consider two assumptions that are useful for inferring 
the student knowledge state: 1) difficult concepts are more 
likely to be unknown, and easy concepts are more likely to 
be known; 2) similar concepts are more likely to have the 
same value (i.e., a student who knows one concept is also 
likely to know the other concepts that are similar to the one 
that is already known). These influences can be represented 
by an undirected graph where each node corresponds to a 
concept, and each edge connects a pair of concepts that are 
similar (see Figure |1). In the Eedi dataset (described in 
Section |4.2.1p, we consider every pair of concepts that share 
the same super-concept to be similar (e.g., there is an edge 
between “Rearranging Formula and Equations” and “Substi- 
tution into Formula” because they are both under the same 
super-concept “Formula”). Based on this graphical struc- 
ture, we model a student’s knowledge state using a Markov 
Random Field (MRF). 


An MRF is a probability distribution over a set of vari- 
ables that satisfy certain properties defined by an undirected 
graph. In our case, we define a probability distribution 
p over binary variables v1,...,Un defined by an undirected 
graph G = (V U F,E) where V is the set of nodes (con- 
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cepts), F' is the set of factors that define a set of functions 
over the variables that they are connected with, and F is 
the set of edges (see F igure 2h. 


An MRF allows us to calculate the probability of each way of 
assigning values to binary variables v1, ...un, which represent 
the knowledge state of the corresponding concepts C1, ..., Cn- 
The probability p has the form: 


POs 4%) =F [I va(va) (1) 


back 


where a represents a subgraph of G, and Wa denotes a factor 
that defines a non-negative function over the set of variables 
Va ina. Z is a normalizing constant that ensures the distri- 
bution sums to one: 


Z= Y> J] valva) (2) 


V1y-Un WabF 


We specify factors for an MRF based on two assumptions 
about variables v1,...,Un. For our first assumption that dif- 
ficult concepts have a higher probability of being unknown, 
we define unary factors: 


1 — difficulty... 
vlu) =} ene 
difficulty,, 


where difficulty,, is a real number that represents the dif- 
ficulty of the concept c; that v; corresponds to, and 0 < 
difficulty,, <1. A higher difficulty,, value means c; is more 
difficult. 


if VUi= 1 
otherwise 


For our second assumption that variables corresponding to 
similar concepts are more likely to have the same values, 
we define binary factors between every pair of nodes (v4, v;) 
that are connected by an edge in graph G: 


influence if uj = v; 


Pig} Ut V9) = {i — influence otherwise 

where influence represents a constant that satisfies 0.5 < 
influence < 1. A greater influence value means we want to 
assign a higher probability to an assignment that gives the 
same values to variables corresponding to similar concepts. 
In our work, we fix influence to be 0.7. We also tried similar 
values, and they lead to similar results. 


2.3 Quizzing policy for knowledge inference 
Since there is a cost associated with each question we query 
students (e.g., time, student’s energy), we need to select a 
limited number of questions that reveal the most about their 
knowledge state. Thus, student knowledge prediction can 
be framed as a pool-based active learning (AL) task with a 
given query budget 7’. For simplicity, we assume querying 
each exercise leads to the same cost and define T to be the 
total number of queries we are allowed. 


We describe the AL framework in detail, see Algorithm [I] 
At a given time step t, we have a labelled set L that con- 
sists of all the questions we have asked the student and their 
responses. Formally, L = {(x',y*)}{_, where x’ € X, and 
y’ € {0,1} is the student’s response to 2’ (y’ = 0 if the 
response is incorrect, y’ = 1 if the response is correct). We 


[vee 
rea 


Figure 2: Modeling graphical student knowledge using 
MRF. This models the knowledge representation in Figure 
[i] as a factor graph. Each node v; is a binary variable that 
represents the knowledge state of the corresponding concept 
c;. Factors are represented by rectangles. There is a unary 
factor for every node and a binary factor between every pair 
of nodes connected by an edge to model the dependency 
between variables. 


also have an unlabelled set U consisting of all the questions 
that we have not asked). Based on L, we have a belief Br 
about the student’s knowledge h. Formally, B;, = [b1, ..., On] 
where 6; is the probability of knowing the concept c; (i.e., 
v; = 1 with a probability of b;). We define Binary(Bn) 
as a function that converts probabilities into binary values 
using a threshold of 0.5 (1 if b; > 0.5 and 0 otherwise). 
Binary(B),) gives the inferred binary knowledge state. We 
update B, based on L by running the Loopy Belief Propaga- 
tion algorithm (LBP) on our graph defined in Section 
LBP takes L as input and outputs the probabilities 
bi, ...,6n (0 < b; < 1). Additionally, we have a QP that 
takes By, as input and outputs the next question to ask the 
student. Specifically, a policy 7(-|Br) provides a probability 
distribution with support over all questions in U given Bp. 
We can then sample a question from z(-|Bp). 


Algorithm 1: Active learning for inferring knowledge 
Input: budget T, quizzing policy 
Output: h 
Initialize Lo + 0, Uo + {ai}, 
for t = 1,2,3,...,7' do 
By, = LBP(L1-1) 
xt ~ (Bra) 
It << Lt-1 U (x*, y*) 
Uz — Ui-1\2" 
end 


h < Binary(Bn,) 


Algorithm [I] runs as follows: at each time step t, we first 
get our current belief Bz, based on the previously labelled 
set Li—1 (i.e., the set of all the questions we have asked the 
student before time step t and their responses). We then 
select a question a’ from the previously unlabelled set Ut_1 
to ask the student by sampling from 7(-|Bn,), which defines 
a probability distribution with support over all questions in 
Uz-1 given Bn,. Then, we update Uz_1 to Ut by removing at 
from U;-1 and update L4~1 to Li by adding a’ and its label 
y’ to Ly_1. The quizzing process terminates when the query 
budget is exhausted. In this work, we fix T’ = 10 as required 
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by the NeurIPS 2020 Education Challenge [27]. The final 
output of the algorithm h is the student’s knowledge state 
at time step T, which is inferred based on By. 


2.4 Evaluation 

We evaluate our QPs using two methods. First, we create a 
synthetic dataset consisting of simulated students (see Sec- 
tion [4.1). We predict each student’s knowledge state using 
Algorithm [1] Given a prediction result h, we calculate the 
prediction accuracy using the following equation: 


7 1 - ae KT: 
Ace(h) = = S> 1 (hfi] = h* Ud) (3) 
i=l 
where h* is the actual knowledge state. 


Second, we apply our QPs to the NeurIPS 2020 Education 
Challenge (see Section (4.2). The challenge is to obtain a 
limited set of answers from each student for predicting the 
correctness of their answers to the remaining questions. Our 
approach to this challenge is to first infer a student’s knowl- 
edge state using Algorithm[I]and then predict the student’s 
responses to the remaining questions based on the inferred 
knowledge state. Specifically, we design an additional model 
(see Section that takes in our belief about the stu- 
dent’s knowledge state at the final time step By, and out- 
puts the student’s response to each of the m questions. For- 
mally, the vector y € R™ denotes the output of the model. 
We calculate the prediction accuracy as: 


Aeo(9) = S> 19) = 9*fa) (4) 


x2,E€Up 


where Ur is the set of unlabelled questions at the final time 
step (unseen by the model), Y*[i] is the student’s actual 


response to 7;, and Vii] is the predicted response. 


3. DESIGNING QUIZZING POLICIES 


In this section, we present heuristics-based approaches and 
a reinforcement learning (RL)-based approach to designing 
a quizzing policy (QP) that takes in a belief B;, about a 
student’s knowledge state and outputs the next question to 
ask the student. 


3.1 Heuristic approaches 
We present two simple heuristics for designing a QP: random 
selection (QP-RANDOM) and uncertainty sampling (QP- 
UNCERTAIN). QP-RANDOM is straightforward: we always 
randomly select a question from the unlabelled set U (i.e., 
t(a|Br) = al for each a € U). QP-UNCERTAIN suggests 
picking a question corresponding to a concept that our cur- 
rent model is most uncertain about (i.e., the concept with a 
probability of being known that is closest to 0.5). Formally, 
we define: 

b* = arg min |b; — 0.5] 

bj €Bp, 

We first pick a concept c* with a probability of being known 
that is equal to b*. We break ties randomly. We define U<« 
as the set of questions that have not been asked and are 
associated with c*. We then define the policy: 


m(a|Bn) = tial 


if a € Ucs 


0 otherwise 


LBP —>B,y, —> 6 — 7(als;) 


I 


ar = xt 
rr 


| 


(xt, y!) <————_ student 


Figure 3: QP-RL approach. 


3.2 RL-based approach 


We now propose an RL-based approach (QP-RL) for learn- 
ing a QP. An RL agent learns how to make good decisions 
over time by interacting with an environment that is typi- 
cally modeled as a Markov Decision Process (MDP). In our 
problem setting, we define the MDP M = (5S, A, P, R, so) as 
follows: 


e The state space S is the set of beliefs By, about student 
knowledge (i.e., S = {[b1, ..., bn]|O < bi < 1}); 


e The action space A is the set of questions that have not 
been asked; 


e The transition dynamics P : S x A x S — R define the 
probability of transitioning from one state to another by 
taking a particular action. In our case, we transition to 
state sz41 from s; based on the student’s response y’. 


e The reward function R: S x Ax S —> R is defined as the 
difference in prediction accuracy between the current time 
step and previous step: for predicting student knowledge, 
given the inferred knowledge state hi+1 after taking action 


at, we calculate the reward for time step t as Acc(hi+1) — 
Acc(h+t); 


e The initial state so corresponds to the initial belief about 
student knowledge: each concept has a 0.5 probability of 
being known. 


Figure [3] shows an overview of the QP-RL approach. For 
training the RL agent, we consider an episodic, finite-horizon 
setting. During each episode, we train on one student’s data, 
and the length of the episode is the query budget T. At each 
time step t, we run the LBP algorithm that takes in the 
student’s response history Le-1 = {(x*,y')}‘=] to update 
our belief about the student’s knowledge state B,,. Then, 
the RL model, which is a neural network with parameters 
6, takes Bp, as input (ie., s: = Bn,) and outputs a vector 
Pc € R” which represents the probability of selecting a ques- 
tion corresponding to each of the n concepts. We first select 
a concept c; by sampling based on pe. and then randomly 
select one question from U., (a set of questions that have not 
been asked and are associated with c;). We then define the 


final policy parametrized by 0: 76(a|Bn,) = a for c, CE C 
and a € A. Our policy 79(a|Bn,) allows us to select the 
next question to query and add the next question-response 
pair (z',y) to the response history. We then update Bn, 
based on the updated response history using the LBP al- 
gorithm. We calculate the reward for the current time step 
r, = Acc(Binary(Bn,,,)) — Ace(Binary(Bn,)). 
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We use REINFORCE policy gradient method [29| to 
learn our policy 79 parametrized by @. In each episode cor- 
responding to a single student, the RL agent performs an 
update as follows. First, an initial state so (the initial belief 
that each concept has a 0.5 probability of being known by 
the student) is generated. Then, the policy 7 is executed 
until the episode ends, generating a sequence of experience 
given by (s¢, at, Tt)t=1,2,...,7- Then, in this episode, for each 
t € {1,2,..., 7}, we use the following gradient update with 
7 as learning rate: 


qe 


6+ O0+4n- (Dr) : (vo log (0 (at | si))) (5) 


T=t 


“ ———————————— 


gradient at time step t in an episode 


In experiments, we use the architecture used in [7]. Specifi- 
cally, the policy network is a 3-layer fully connected neural 
network with the following architecture: the input layer has 
n = 57 units for B,; the first and second hidden layers 
have 128 hidden units; and the output is a vector pe € R” 
where n = 57 to produce a probability of selecting each of 
the 57 concepts. The first two hidden layers use ReLU ac- 
tivations, and the final layer uses the softmax function to 
ensure probabilities sum to 1. We use ADAM optimizer 
for training. 


4. EXPERIMENTAL EVALUATION 


We first evaluate and compare our quizzing policies (QPs) 
using a synthetic dataset. We then apply our QPs to the 
Eedi dataset from the NeurIPS 2020 Education Challenge. 


4.1 Simulations 

We simulate virtual students taking the assessment quiz and 
test how well we can predict students’ knowledge states in 
a controlled setting using different QPs. 


4.1.1 The synthetic dataset 

We generate a dataset consisting of 24,000 simulated stu- 
dent knowledge states. To do so, we first construct a graph 
for representing the student knowledge state that we aims 
to infer (see Section [2.2) and then get a probability distri- 
bution over the binary variables in the knowledge state that 
satisfies a set of assumptions about the student’s knowledge. 
We then sample ground-truth student knowledge state val- 
ues from the probability distribution. In this simulation, we 
use the same 57 concepts in the Eedi dataset (described in 
Section [4.2.1) for constructing the graph. We assume some 
of these concepts have different levels of difficulty, and simi- 
lar concepts are more likely to be assigned the same knowl- 
edge state values}"| Based on these assumptions, we assign 
a value of difficulty to each of the 57 concepts. We define 
difficulty,, = 1— the average correctness of the concept c 


? Although our assumptions might not hold in a real-world 
setting, the goal of this experiment is to compare differ- 
ent QPs and investigate the potential of QP-RL for learn- 
ing a strategy tailered to a pre-defined knowledge struc- 
ture. For instance, compared to the heuristic approach QP- 
UNCERTAIN, QP-RL should learn to select the questions 
that are not only uncertain but can also give more infor- 
mation about other questions that are not selected (e.g., 
selecting questions corresponding to concepts that are con- 
nected with a lot of the other concepts). 


Table 1: Test performance of different QPs on the syn- 
thetic dataset. QP-UNCERTAIN achieves a better perfor- 
mance than QP-RANDoM, and QP-RL improves over QP- 
UNCERTAIN significantly. 


QP Accuracy 
QP-RL 0.721 + 0.004 
QP-UNCERTAIN | 0.700 + 0.002 
QP-RANDOM 0.675 + 0.003 


° 
N 
N 


0.70 


Cumulative average accuracy 


0.68 4 
0.66 QP-RL 
—— QP-UNCERTAIN 
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Figure 4: Training performance of QP-RL on the synthetic 
dataset compared to heuristics. QP-RL improves over QP- 
RANDOM and QP-UNCERTAIN after about 6,000 episodes of 
training. The cumulative average accuracy at each episode 
is calculated as the average accuracy across all previous 
episodes. It is important to note that QP-RANDOM and 
QP-UNCERTAIN are fixed policies that are not being trained. 
The cumulative average accuracy for the first few episodes 
might seem noisy due to small sample size. 


across all students’ answers in the Eedi dataset] We run 
the LBP algorithm on the constructed graph to get a proba- 
bility distribution from which we sample student knowledge 
states. Specifically, the output of the LBP algorithm gives 
the probability of knowing each concept, and we sample val- 
ues of 0 or 1 for each concept to generate the ground-truth 
student knowledge states in our synthetic dataset. 


4.1.2 Results 

We split the dataset into 23,000 students as the training 
set and 1,000 students as the test set. We train QP-RL 
until the cumulative average accuracy converges. Figure 
shows the training performance of QP-RL compared to 
fixed heuristics. After training, we run each QP 10 times 
on the test set to calculate the average accuracy and stan- 
dard deviation across these 10 trials, see Table [i] Although 
QP-RL leads to a 2% gain in accuracy compared to QP- 
UNCERTAIN, it requires a moderate amount of training data 
(> 6,000 students in this case). QP-UNCERTAIN is a less 
optimal strategy but can achieve a reasonably good per- 
formance without any training data. These results provide 
initial evidence that QP-RL can learn an effective QP, and 
the performance can be improved further with more data. 


3For simulations, one could also try other difficulty values, 
but it does not matter which specific difficulty value we as- 
sign to each concept because the goal is to model a setting 
where we have concepts of varying levels of difficulty. 
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What is the size of the 
obtuse angle AOC? 


Figure 5: An example of a question in the Eedi dataset 
[27]. For each multiple-choice question, exactly one choice 
is correct. 


4.2 NeurIPS 2020 Education Challenge 

We then apply our QPs to one of the tasks in the NeurIPS 
2020 Education Challenge (see Section |4.2), which is to ob- 
tain a limited set of answers from each student for predicting 
the correctness of their answers to the remaining questions. 
Our approach to this challenge is to first infer a student’s 
knowledge state using Algorithm[I]and then predict the stu- 
dent’s responses to the remaining questions based on the 
inferred knowledge state. 


4.2.1 The Eedi dataset 


The Eedi dataset contains student responses to multiple- 
choice questions (see Figure[5) on various math topics, which 
was collected between September 2018 and May 2020. It 
contains 948 questions and a total number of 1, 508, 917 re- 
sponses to these questions from 6, 148 students. The dataset 
is split into the training set (4918 students), the validation 
set (615 students), and the test set (615 students). 


Each question in the dataset is associated with a list of sub- 
jects. Each subject covers an area of mathematics. These 
subjects are arranged in a tree structure by experts based 
on the generality of the subjects. For instance, “Fractions” 
is the parent subject of “Multiplying Fractions” and “Simpli- 
fying Fractions”. For simplicity, we only consider the most 
granular subject (i.e., the leaves in the tree) as the concept 
that each question corresponds to. The 948 questions cor- 
respond to 57 unique concepts. We consider concepts that 
share the same super-concept (i.e., parent) to be similar (see 


Figure [ip. 


4.2.2 Student performance prediction 

To predict a student’s responses to unseen questions based 
on the inferred knowledge state, we propose a neural network- 
based model that takes in the belief about the student’s 
knowledge Bz, at time T = 10 (our belief about their 
knowledge after we have asked 10 questions) and outputs 
the probability of answering each of the 948 questions in 
the dataset correctly. The student performance prediction 
model is a 3-layer fully connected neural network with the 


Table 2: Test performance different QPs on the Eedi dataset. 
QP-RL improves slightly over QP-UNCERTAIN. 


QP Accuracy 
QP-RL 0.690 = 0.005 
QP-UNCERTAIN | 0.680 + 0.003 
QP-RANDOM 0.684 + 0.003 


following architecture: the input layer has n = 57 units for 
Bn; the first hidden layer has 256 hidden units; the sec- 
ond hidden layer has 512 units; and the output is a vector 
y € R™ where m = 948 to represent the probability of cor- 
rectness for each of the 948 questions. The first two hidden 
layers use ReLU activations, and the final layer uses the sig- 
moid function to ensure the output values are between 0 and 
1. We use ADAM optimizer for training. We convert 
the output probabilities into binary values of 0 or 1 (0 if the 
probability is less than 0.5, 1 otherwise) and calculate the 
prediction accuracy using Equation [4] We train the model 
using randomly selected queries until the validation accu- 
racy converges. The model parameters are updated based 
on binary cross-entropy loss. 


4.2.3 Results 

Given a trained performance prediction model from Section 
4.2.2| we then train QP-RL using the difference in final 
prediction accuracy between time steps as reward signals: 
r, = Acc(Binary(Y,)) — Acc(Binary(Y,_1)). After train- 
ing, we run each QP 10 times on the test set to calculate the 
average accuracy and standard deviation across these 10 tri- 
als. Table [2] shows that QP-RL improves slightly over QP- 
UNCERTAIN, but the difference between QP-RL and QP- 
RANDOM is not significant. Results in Section [4.1.2] show 
that in a more controlled setting, QP-RL already requires a 
moderate amount of training data (> 6,000 students) to im- 
prove over heuristics. However, we only have training data 
from about 5,000 students in this experiment. Learning a 
QP from real students’ data that are noisy is more challeng- 
ing, and it may be the case that improving QP-RL further 
would require a much larger dataset. Even though QP-RL 
seems to require a substantial amount of training data, this 
is a one-time training, and the learned policy can be applied 
to future students. 


5. CONCLUSION 


Student assessment is a crucial component of many online 
education systems for improving student learning outcomes. 
Inferring student knowledge state by quizzing poses a tech- 
nical challenge: maximizing accuracy while minimizing the 
quizzing cost. In this paper, we show initial evidence that 
reinforcement learning (RL) provides a potential solution, 
improving over heuristics given sufficient training data. 


There are several research directions for future work. Fur- 
ther gains in accuracy could be achieved by exploring more 
powerful RL techniques and more complex student knowl- 
edge modeling techniques. In this work, we model all con- 
cepts that share the same super-concept as having the same 
relationship; however, there could be prerequisites as well 
as weaker and stronger relationships in reality. It would be 
important to study whether varying the influence values be- 
tween concepts would lead to gains in model performance. 
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ABSTRACT 


The distributed practice effect suggests that students re- 
tain learning content better when they pace their practice 
over time. The key factors are practice dosage (intensity) 
and timing (when to practice and how in between). In- 
spired by the thriving development of image recognition, 
this study adopts one of the successful techniques, multires- 
olution analysis (MRA), to model distributed and spaced 
practice (SP). We consider a sequence of practice sessions 
as a signal of the student’s learning strategy. Then, we 
apply the stationary wavelet transform (SWT) to extract 
practice patterns spaced by three periods: small, medium, 
large. The result reveals a positive correlation between the 
small-spaced practice and the exam grade. The benchmark 
against baseline feature models shows that the SP patterns 
significantly improve the goodness-of-fit and complements 
the baseline models. This work successfully demonstrates 
1) the use of MRA in modeling sequential patterns by event 
intensity and event timing; 2) the MRA approach can be 
used as an alternative method to improve existing student 
models of practice effort. 


Keywords 

distributed practice effect, testing effect, stationary wavelet 
transforms, signal multiresolution analysis, feature extrac- 
tion 


1. INTRODUCTION 


In the midst of blended and distance learning environments, 
it is increasingly important for students to manage their 
time efficiently. Numerous researchers have proposed and 
developed various student models to capture how students 
utilize their time during the learning process. The results 
have shown that distributed practice is a simple but effective 
time-management strategy for learning [5]. Essentially, dis- 
tributed practice comprises the testing and spacing effects, 
which suggest that the retention of information increases 
when the learner practices retrieving it in multiple spaced- 
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out practice sessions [3]. 


Optimal distributed practice requires a combination of both 
the intensity and the timing of the practice events. In other 
words, an expressive student model must capture the in- 
tensity of practice sessions spaced by different periods. Al- 
though the two features appear to be straightforward, it is 
not easy to incorporate them in a sequential behavior model. 
For example, typical sequence analysis or sequential pattern 
mining would expect discrete input data and extract com- 
mon patterns in the data according to the sequence sup- 
port (the number of occurrences). Finding a meaningful 
and interpretable threshold is usually an ad-hoc process and 
particularly challenging [4]. A great threshold value may 
increase the chance of losing detail, and a small value may 
introduce more noises and miss the context. In the case of 
distributed practice, when the practice sessions are far apart, 
such a frequency-based approach will require more data to 
ensure sufficient within- and between-sequences support for 
a pattern of interest. To address this modeling challenge, 
we are motivated to explore an alternative computational 
method to capture the detail as well as the context, which 
can capture both the intensity and the timing of events at 
the same time. 


We rationalize that a student’s practice sessions distributed 
over a timeline resemble a signal to her/his learning process 
where the strength of learning is quantified as the increasing 
or decreasing values about the occurrences of the underlying 
events. With this definition, we can utilize a signal process- 
ing tool to extract the structural variation which approxi- 
mates distributed practice patterns. In this work, we adopt 
the stationary wavelet transform (SWT) algorithm for this 
purpose. SWT is a widely-used signal processing tool in an 
application such as image pattern recognition. The algo- 
rithm decomposes an input signal into multiple components 
and represents the original signal by information at different 
resolutions. With the emphasis on the structure, we believe 
that SWT will allow us to overcome the challenge where the 
amount of sequential data may not be big enough to main- 
tain the sequence support. Additionally, applying SWT as 
a feature extraction method also allows us to examine struc- 
tural nuances in behavior sequences. 


2. RELATED WORK 


2.1 Sequence Analysis in Educational Data 
Mining 
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A behavior sequence is a chronicle of an activity. It describes 
a collection of events, and the order of them is meaningful. 
We can choose different features to characterize such a se- 
quence, e.g., types of events, arrangements of events, time 
gaps between events. The features directly affect what we 
can find out from the analysis. Sequence analysis, in gen- 
eral, can refer to any data model that involves a kind of 
behavior sequences its characterization. Extensive research 
in EDM has been using behavior sequence analysis to model 
students’ development of knowledge or skills. 


The most intuitive approach is sequential pattern mining, 
which aims to discover repeated string patterns, alignments, 
or the very next possible items [11]. For example, Gitin- 
abard et al. characterize behavior sequences by students’ 
interactions with online tools [7]. They map the interaction 
sequences to study habits and use sequence patterns to dif- 
ferentiate the high-performing and low-performing students. 
Dermy and Brun argue that the time interval is the key to 
model students’ activities [4]. They characterize behavior se- 
quences by time intervals between events and formalize the 
temporal information in sequential pattern mining. Their 
experiment suggests a strong correlation between the stu- 
dents’ activities and the time information. 


One research gap we notice is that most of the reviewed 
works focus on the behavior sequences at a single time scale. 
For example, for a given behavior sequence €1, €2, ..., es where 
e; is an event that occurs at time 7. A typical sequence analy- 
sis focuses on the relationship of adjacent events e;—1, ej, €j+1 
where j € l,...,¢. Since the step size is 1, sometimes such 
a sequence is called 1-sequence. Following this setting, a 
pattern must be a consecutive 1-sequences that meets pre- 
defined criteria, e.g., the support. One limitation of 1- 
sequences is that they cannot capture an inconsecutive event. 
Such an inconsecutive event can provide a coarser view of 
the behavior sequence, therefore the context. Indeed, we can 
try to increase the step size to have 2-sequences, 3-sequences, 
or k-sequences where k € Z. Nonetheless, the increment of 
step size inevitably reduces the number of k-sequences we 
can find in a dataset. This situation may exclude potential 
sequences of interest due to the threshold of the support 
or the shortage of data. To tackle this challenge, we inves- 
tigate an alternative model that focuses on the structural 
information of behavior sequences. 


3. MULTIRESOLUTION SIGNAL ANALY- 
SIS 


In pattern recognition, the information of a given object 
usually is determined by the variations of signal intensity. 
For example, we can recognize a building as a building in 
an image because the distinct contours and shapes are for- 
mulated by their unique signal value sequences and different 
from the other objects. Such signal features are essentially 
sequences of values (sets of numbers) where a variation of 
intensity could suggest a potential event of interest, e.g., a 
change of shapes or colors. However, because the objects 
to analyze may have different shapes and sizes, the feature 
extraction must consider “how far away” an event is from its 
neighborhood to recognize the objects’ structures at mul- 
tiple resolutions. The field of computer vision and signal 
processing have developed various methods to address this 
challenge. One of which is the multiresolution analysis and 


Detail Signal 7 

Input Dy F(x) 
Approximation Signal G . 

Ay f(x) 


Figure 1: The Decomposition of Multiresolution Analysis. 
The process consists of two filters: the high-pass filter H and 
the low-pass filter G. They iteratively extract the detail sig- 
nal and the approximation signal at the resolution 2/ from the 
input signal f() until a maximum level L. We can associate 
the interpretation of the detail signal to the underlying time 
scale. For example, say the sampling rate of the input signal 
is 1. The detail signal at level 1 (a coarser level) denotes the 
information from the frequency band [1/2, 1/4]. 


wavelet transforms, which fit in the scope of this research. 


The multiresolution analysis (MRA) is a hierarchical frame- 
work that describes how to decompose a signal from fine 
to coarse levels [12]. The decomposition consists of a high- 
pass filter (H) and a low-pass filter (G). They are a pair of 
quadrature mirror filters and have the following relationship: 
g(n) = (—1)'~"A(1 — n) [12]. The high-pass filter extracts 
impulses, and meanwhile, the low-pass one retains the other 
information. This process is also known as Discrete Wavelet 
Transform (DWT). By convolution (*), the filtering process 
iteratively produces series of detail signals (D.; f(x)) and 
approximation signals (A.; f(x)) for the input signal f(a): 


Daj f = (f(u) * Go3(—u)) (2 n) (1) 
Aoi f = (f(u) * $23 (—u))(2-?n) (2) 


where n € Z. The high-pass and low-pass filters rely on a 
wavelet function (7) and a scaling function (¢) that trans- 
late and scale the input signal at different resolutions, re- 
spectively. We illustrate the whole filtering process in Fig- 
ure 1 for reference. See [2] and [12] for more details about 
the math properties of the wavelet function and the scaling 
function. 


3.1 Analyzing Distributed Practice via Signals 
In this study, we focus on the practical implication of MRA 
and illustrate how it can help identify students’ distributed 
practice patterns. Students, especially those in an online 
learning or a blended learning environment, usually have 
greater flexibility in self-pacing their studies. In other words, 
they can watch the lecture videos and practice quiz ques- 
tions anytime at their convenience. This nature makes it 
challenging to analyze their behaviors on the timeline. 


For example, in scenario A, when a semester is two to three 
months long, we may find out that the students’ practice 
sessions are sparse and do not follow one unified schedule. 
This makes the time of sessions less discriminating in find- 
ing common behavioral patterns. Thus, the researcher may 
choose to ignore the time feature. An alternative approach 
(scenario B) is aggregating the practice sessions by a priori 
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assumption (e.g., students always study on a week-by-week 
basis or right before a deadline). However, all the above ap- 
proaches may inevitably lose some detail about how exactly 
the students utilize their schedules, either missing discrim- 
inating patterns over time (scenario A) or limited to those 
strictly abiding by the class-paced schedule (scenario B). 


To model students’ distributed practice behavior over time, 
his/her behavior can be denoted as a sequence of the events, 
f with T discrete time steps: f = {e1, €2,...,er} where e; 
for 1 <i < T can be any activity event of our interest. 
A student distributes his/her practice sessions at different 
rates or frequencies according to his/her preferences, path, 
or pace. This representation is like a signal and enables the 


feasibility to apply a signal processing algorithm. 


Similar to the pattern recognition in computer vision, a stu- 
dent’s practice sessions are like the shapes and colors that 
may evolve according to the sequences of signals. The ses- 
sions may have different sizes, i.e., time gaps between any 
two sessions. In other words, we aim to extract distributed 
spaced practice (SP) that are subsets of the input behavior 
sequence: SP, C f where k € N and any two consecutive 
event items {e;,e;} C SP, are spaced by k time steps. Fol- 
lowing this idea, MRA is used to extract such a “feature” 
from sequences of practice sessions, and thereby interpret 
the output as distributed practice patterns. For a practice 
signal at sampling rate = 1/day, the output signals can rep- 
resent the information at coarser rates, e.g., 1 per 4 days 
and 1 per 8 days. 


3.2 Stationary Wavelet Transform 

The output of DWT are signals that represent information 
at different resolutions (or frequencies). The typical imple- 
mentation of DWT keeps downsampling the input signal to 
obtain the detail signal and the approximation signal at each 
resolution [12]. Therefore, the transform is time-variant. 
The detail signal at one level is a half shorter than the one at 
the previous level. This property may cause a misalignment 
in time/frequency, which will make the decomposition gener- 
ate fewer feature values for analysis. In this study, we follow 
an alternative implementation of MRA, Stationary Wavelet 
Transform (SWT), which is time-invariant. SWT replaces 
the downsampling by upsampling at each step [6]. Research 
has shown that SWT can improve the approximation and 
a preferred approach for applications like breakdown point 
detection and denoising [1]. 


4. DATASETS 


To evaluate the method, we use two semesters’ datasets from 
the same undergraduate class offered in a four-year univer- 
sity in the United States: Spring 2018 (SP18) and Fall 2018 
(FA18). Both sessions lasted about 3-month. The class was 
a typical lecture-style in-person class with weekly assign- 
ments and monthly exams. The two sessions were prac- 
tically identical, having exactly the same syllabus, same 
instructor, same teaching assistants, except for minor ad- 
justments to the exam questions. Note most students in 
FA18 shared a similar background in engineering because the 
class was a required class for first-year engineering students. 
In SP18, there were more students from non-engineering 
schools, which resulted in much more diverse student back- 
ground. 


An online practice platform was introduced to the students 
at the beginning and available throughout the semester. On 
the platform, students could take multiple-choice questions 
to practice and review the class content. For any given prac- 
tice question, the students had unlimited chances to retry; 
for any attempt, the corrective feedback (correct answer) 
would be provided upon submission. The questions served 
like so-called “tasks” in the context of tutoring systems [16]. 
Each of the tasks aims to help the student master some 
knowledge (or embedded knowledge components). However, 
the practice activity is different from working with assign- 
ments: there is no “hard deadline” by which the students 
must complete the practice questions. The students can 
practice on the platform as a kind of self-assessment [13]. 
In other words, the activity is “self-paced” [18] and aligned 
with the actions of reviewing slides, taking quizzes, or other 
practices that students can do for their benefit whenever 
they want. 


The students’ practice activities were logged as transactions 
of events, including the timestamps, the questions, and the 
correctness of the attempts. We processed and transformed 
the data into sequences of daily practice intensity. Here, the 
term “intensity” refers to the number of unique questions 
solved by a student. Each day is assumed to be a complete 
practice session. The sequence of daily intensity thereby 
resembles a discrete-time signal sampled at a constant rate 
equal to 1 sample per day. We excluded some students’ data 
from the analysis due to low usage (those who only had only 
one practice session throughout the semester). An overview 
of the datasets is described in Table 1. 


In this study, the exam letter grade is used as the students’ 
learning performance index. The exam letter grade ranges 
from A (M > 90), B (80 < M < 90), to C/D/F (M < 80) 
where M is the raw average of three exam scores. 


5. REPRESENTING DISTRIBUTED PRAC- 
TICE BY SWT SIGNALS 


There are several parameters required for our model pipeline: 
the wavelet for SWT, the padding scheme, the maximum 
decomposition level, and the penalty of change point detec- 
tion. The Haar wavelet is adopted in the SWT algorithm 
implementation, due to the simplest form of wavelet [14]. It 
creates a shape like a step function that produces 1, 0, and 
-1, following the formula 


1 if0<a<3 
-1 ifi<a<1 (3) 
0 otherwise 


Vw) = 


This property makes it a good option for detecting edges 
(e.g., sudden signal transitions or changes) [17] in discrete 
signals like the datasets in this study. The implementation 
of SWT used in this study requires the length of input to be 
a multiple of 2” where L is the maximum number of levels to 
decompose [9]. To meet this requirement, we preprocessed 
all input sequences by adding a prefix of zeros. In our ex- 
periment, we found that the SWT signals at L > 3 did not 
work. It was likely due to short input sequences. Therefore, 
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Dataset | # of Students | # of Included | Max Length of Sequence (Days) | # of Questions | M (SD) of Intensity 
SP18 121 76 (63%) 96 0.26 (0.36) 
FA18 200 67 (34%) 95 0.32 (0.49) 


Table 1: Statistics of the Two Datasets 


D1 band = [2d, 4d) 
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Figure 2: The SWT Signals in the SP18 Dataset. From left to right, the SWT signals D1, D2, D3 capture practice sessions in 
the period bands [2d, 4d], [4d, 8d], and [8d, 16], small-, medium- and largely-spaced respectively. These signals capture practice 
sessions spaced by different periods. From D1 (more detail) to D3 (more context), we can see the focus gradually spreads out 
when the level increases. For readability, the plot excludes practice sequences not having any change points in the SWT signals. 


we set L = 3 in our experiment. Once the SWT algorithm is 
built, we applied the change point detection algorithm with 
the penalty = 0.5 to search for sudden changes in the SWT 
signals [8]. We decided on this penalty value by maximizing 
the group difference (Section 5.2) and the goodness-of-fit of 
regression (Section 6.2). The experiment program and data 
are available at the link for future work". 


5.1 Characteristics of SWT Signals 


The SWT algorithm decomposes the input signal by multi- 
level filtering. Filtering at a level k extracts information at 
the frequency band [1/2* f,1/2**1 f] where f is the sampling 
rate of the input. In our datasets, because the sampling 
rate is 1 sample per day (1 cycle per day), the three de- 
composition levels (Dk where k = 1,2,3) filter the input in 
the frequency bands [1/2, 1/4], [1/4, 1/8], and [1/8, 1/16]. In 
other words, the algorithm filters the input into the period 
(the duration of time of one cycle) bands D1=[2(d)ays, 4d], 
D2=[4d, 8d], and D3=[8d, 16d]. We map these three bands 
to small-spaced, medium-spaced, and largely-spaced prac- 
tice patterns, respectively. Following this interpretation, we 
expect the SWT signals to identify students’ practice ses- 
sions spaced by different periods. For example, D1 can iden- 
tify sessions spaced by 2 to 4 days, which are small-spaced 
practice. 


To further illustrate this characteristic, Figure 2 demon- 
strates what the algorithm found in the SP18 dataset. The 
visualization shows the SWT signals at the three levels. We 
can see that D1 highlights small-spaced practice sessions. 
The D2 and D3 signals spread their focus and “blur” the se- 
quences not fitting their period bands. Note, there may be 
redundancy in the information captured by different compo- 
nents. For example, an input sequence having meaningful 
change points in D3 can also have ones in Dl. Overall, 
the information about practice sessions at different levels 
provides an insight into how the students distribute their 
practice over time. In our analysis of distributed practice 


"https: //github.com/rickchung/edm21-msa 


patterns, we use the number of change points as the feature 
to represent the information from the three SWT signals. 


For readability, we use the lower bound of the frequency 
band to denote the spaced practice patterns. We call the 
practice patterns found in the D1, D2, D3 signals 25P (2-day 
spaced practice), 4SP, and 8SP patterns, respectively. For 
reference, the input daily practice sessions are called 1SP. In 
the SP18 dataset, the means (M) / standard deviation (SD) 
from the three levels are 25P = 1.83/2.68, 4SP = 1.79/2.63, 
and 8SP = 1.75/2.63. In the FA18 dataset, the values are 
2SP = 1.12/2.29, 4SP =1.27/2.14, and 8SP = 1.37/2.30. 


5.2 Marginal Relationship of Spaced Patterns 


with Exam Grades 

We analyzed the relationships between the practice pat- 
terns and student grades by the marginal distribution. The 
Kruskal- Wallies H-test was applied to test if the groups had 
the same population median (Figure 3). The method was 
selected because the sample size was small, and therefore 
the sample might not follow the normal distribution. The 
results showed that there was only 2SP that appeared to 
be significant for both datasets (SP18: H=8.89, p=0.01; 
FA18: H=7.95, p=0.02). The visualization of the distri- 
bution showed that in SP18 A students had a higher 25P 
(M = 3.12, SD = 3.11) than C (M = 1.50, SD = 2.32) and 
B (M = 0.72, SD = 1.79); in FA18, the B students had a 
higher value (M = 2.17, SD = 3.05) than A (M = 0.62, SD 
= 1.50) and C (M = 0.41, SD = 1.14). 


There are more spaced patterns discovered for B students 
in FA18 but not A students, which suggests there could 
be other factors in the correlation of their practice with 
even higher exam grades. For example, engineering and 
non-engineering students may have/need different practice 
strategies adapted to their learning conditions. Despite this 
slight difference across the two semesters, if we focus on the 
difference between the higher-performing students (A/B) 
and the C/D/F ones, the result consistently suggests a posi- 
tive correlation between exam grades and small-spaced prac- 
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SP18 2SP (H=8.89,p=0.01) 


FA18 2SP (H=7.95,p=0.02) 


¢ 


N of Patterns 
N of Patterns 


A B C/D/F 


Figure 3: Marginal Distribution of the SP patterns from the 
Three Grade Groups. The Kruskal- Wallies H-test found that 
only 2SP was significant in both datasets. The result consis- 
tently showed that higher-performing students (A in SP18 or 
B in FA18) had more small-spaced practice than the C/D/F 
ones in SP18 and FA18. 


tice. 


5.3 Quantifying the Schedule of Distributed 


Practice 

We have seen how the SP patterns can help identify practice 
spaced by different periods. However, this feature alone does 
not depict the entire picture of distributed practice strate- 
gies. Another key factor in the distributed practice effect is 
the timing of practice. We develop an index to quantify the 
skewness in practice schedules and investigate its correla- 
tion to exam grades. One simple measure of the skewness is 
the lag time. We can use the lag time between the occasion 
of a practice session and a specific event of interest (e.g., 
exam dates, assignment deadlines) to model the schedule 
skewness. Due to programming is inherently accumulative, 
a later exam covers the content from all the previous ex- 
ams, we cannot assert that a practice session only affects 
the upcoming exam. Considering this case, we focus on the 
time lag since the beginning of the semester. Specifically, 
for an SWT signal at level i, D;, we can compute the lag of 
days between the start of the semester and the occurrences 
of change points. Then, we can transform a practice se- 
quence into a sequence of lags ‘Ua age og TP}. To 
know where on the timeline the student has more practice, 
we compute the sample mean, bees The number, therefore, 
represents how far the schedule is away from the beginning 
of the semester. We further divide the number by the total 
number of days (Nday) in the semester for interpretation. 
The equation of the schedule skewness is defined as 


Lp" 
SS= 4 
Naays ( ) 


When a student has all his/her practice sessions early in the 
semester, SS will be close to zero. If s/he has more practice 
sessions over the middle of the semester, SS will be some 
value over 0.5. We can apply the formula to the input signal 
(1SS) and the SWT signals (28S, 45S, 88S). The result will 
indicate the schedule of different spaced-practice patterns. 


6. MIXED PRACTICE EFFECTS IN MUL- 
TIVARIATE ANALYSIS 


A distributed practice strategy is multifaceted. The univari- 
ate analysis is insufficient because it does not consider the 
confounding variables. There are two cases remain unclear. 
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Figure 4: The Standardized Values (Means) and the Inter- 
actions of the Basic and Experiment Features. The plot 
groups the features into intensity and timing according to 
their functions. The y-axis shows the standardized values (by 
(X — Mx)/SDx in each feature category X) for between- 
features comparison. The vertical dashed lines separate the 
basic and experiment features. The A students in SP18 have 
the highest Totallntensity. On the contrary, the B students 
in FA18 have the highest Totallntensity. Although the basic 
features somehow correlate with the experiment features, we 
can find more discriminating differences in the experiment 
spaced-practice patterns. 


First, the C/D/F students from SP18 do not have better 
performance, even though they put efforts into practice just 
like the better performing students do (A students). Second, 
in FA18, the analysis does not explain the practice strategy 
of the A students. They achieve a good grade but do not 
show significantly more SP. These cases suggest that there 
could be other factors in the distributed practice effects. 


Following this idea, we try to use multivariate analysis that 
includes experiment SWT features and commonly-used ba- 
sic features. The basic set comprises the following. For 
the practice intensity, we use the total number of questions 
solved (TotalIntensity) and the total number of daily prac- 
tice sessions (1SP). For the practice schedule, we use the 
standard deviation of the daily intensity (SDIntensity) and 
the SS of daily practice sessions (1SS). For reference, we plot 
the standardized values of the features in Figure 4. The fig- 
ure shows that although the basic features correlate with the 
experiment features, we can potentially find more discrim- 
inating difference in the spaced practice patterns between 
the grade groups. 


6.1 Assessing the Marginal Effects by Multi- 


nominal Logit Regression 
To understand the relationship between multiple feature and 
exam grades, we use the multinominal logistic regression and 
investigate the marginal effects of the feature values. The 
multinominal logistic regression (MLogit) is a generalized 
version of the logistic regression for multiclass classification 
problems [15]. We can use MLogit when the dependent vari- 
able in a query is nominal (categorical) and has more than 
two possible categories. The setup of MLogit is similar to 
the logistic regression. We assume a linear relationship be- 
tween the independent variables (predictors), X, and the 
dependent variable (response), Y, and model the probabil- 
ity of the Y € {y1,..., yx} by the logistic function (sigmoid) 
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and k-1 sets of weights (w, for the label y;): 


—— exp(wro + 30, wriXi) 
1+ D0, exp(wyo + D1, wii Xi) 
(5) 


We can obtain the prediction by picking up the class with 
the highest probability. The main advantage of MLogit over 
other classification techniques is the interpretability. We can 
explain the contribution of individual features to the output 
probability (dx/dy) similar to the linear regression [15]. In 
the analysis, we set Y as the grade groups (A, B, C/D/E) 
and examine the marginal effects with respect to X when 
the model fits different sets of predictors. 


P(Y => yr|X = X1,..., Xn) 


6.2 Comparing Alternative Models 

To understand the capability and limitation of the SWT 
model of distributed practice, we use MLogit to fit various 
baseline and experiment feature sets. Afterward, we bench- 
mark the quality of these models by the goodness-of-fit. Due 
to MLogit does not use the standard R?, we use the mea- 
sure of the goodness-of-fit by McFadden’s pseudo R? [10]. 
McFadden’s pseudo R? uses the formula 


In L(M fut) 


2 
# ; In L( Mintercept) (6) 


where L is the estimated likelihood. A small ratio of the 
two log-likelihoods (or a large McFadden’s pseudo R?) sug- 
gests that the full model is better than the intercept model. 
We can use this measure to benchmark one model against 
another if they fit the same data. 


We compared the experiment and alternative baseline mod- 
els. The result showed that none of the baseline models 
were competitive with even the simplest SWT model (using 
only 2SP, 4SP, 8SP). The best baseline model (Mgase ati) 
used all the baseline variables and achieved R? = 0.04 in 
SP18 and R? = 0.07 in FA18. The simplest SWT model 
(MrxpDose) achieved R? = 0.07 in SP18 and R? = 0.09 in 
FA18. The best experiment model (Mzzpai) used all the 
SWT variables and achieved R? = 0.12 in both SP18 and 
FA18. Using all the baseline and experiment variables, the 
ensemble model (Mensembie) unsurprisingly outperformed 
all the other models and achieved R? = 0.13 and R? = 0.23 
in SP18 and FA18, respectively. 


6.3 Marginal Effects in the Regression 


Models 

In SP18, Mezpdose found 25P was a significantly-positive 
predictor for the A students (dx/dy = 0.07, p = 0.01). 
Mezpau also found that 25P was a significant predictor 
for the A students (dx/dy = 0.10, p = 0.00). Besides, it 
found 48S and 88S were significant for the C/D/F students 
(dx/dy = 2.55, p = 0.02; dx/dy = -2.58, p = 0.03). In FA18, 
MezpDose found 25P was significantly-positive predictor for 
the B students (dx/dy = 0.12, p = 0.00). Mzzpau, however, 
did not find any significant predictor. 


Part of the result is similar to the analysis of marginal distri- 
bution. In SP18, an increase of small-spaced practice adds 
to the likelihood of A. In FA18, the same effect works for 
B. It is worth noting an additional finding in Mznsembie 
from SP18. When we control the intensity and SS, the 
model shows two extra significant predictors for the grade 
C/D/F: 4SS and 8SS. The marginal effect suggests that an 
increase/decrease in 45S/8SS adds to/reduces the likelihood 
of C. Since an increase in SS means the schedule becomes 
later in the semester, these two findings somewhat suggest 
the same thing: students who practice early and space the 
practice largely are less likely to obtain C/D/F. 


It is also worth noting that the one in FA18 improves the 
most from the best experiment model and reaches R? = 
0.23. When predicting the A students, the model shows 
18S (dy/dx = 1.01, p = 0.00) and the total intensity (dy/dz 
= -0.02, p = 0.04) are significant predictors; when the model 
predicts the C students, 15S is the only significantly-negative 
predictor (dy/dx = -0.90, p = 0.01). We do not find the 
same effect in any of the baseline models. The result com- 
plements a missing part of our analysis about the A and C 
students’ practice strategies in FA18. It suggests that an 
increase in 15S adds to the likelihood of A. Conversely, the 
same increase reduces the one of C/D/F. In other words, 
more early or late practices in the semester may reduce or 
improve the probability of C/D/F or A, respectively. 


7. CONCLUSIONS 


Students’ practice behavior is challenging to model because 
they can practice anytime and do not necessarily follow a 
unified schedule. This study aims to build such a feature 
model that can help researchers describe the distributed 
practice behavior. We adopted the method from multireso- 
lution analysis to extract patterns of our distributed prac- 
tices, focusing on two factors in the distributed practice ef- 
fect: intensity and timing. In the experiment, we applied 
the MRA model and extracted features that could repre- 
sent practices spaced by different periods, including small 
(2-4 days), medium (4-8 days), and large (8-16 days). These 
three kinds of practice patterns were analyzed to explain 
their correlation to the exam grades. We found that stu- 
dents who practiced early and spaced the practice by the 
small and large periods were more likely to get a higher grade 
than C/D/F. Also, the students having more small-spaced 
practices throughout the semester (i.e., practicing more per- 
sistently) were more likely to get better exam grades. Addi- 
tionally, the MRA model was benchmarked against baseline 
models. The result showed that the MRA model not only 
achieved a better goodness-of-fit than the baselines when 
working alone, but it could complement a baseline model 
and achieve better performance. 
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ABSTRACT 


Similar content has tremendous utility in classroom and on- 
line learning environments. For example, similar content 
can be used to combat cheating, track students’ learning over 
time, and model students’ latent knowledge. These different 
use cases for similar content all rely on different notions of 
similarity, which make it difficult to determine contents’ sim- 
ilarities. Crowdsourcing is an effective way to identify sim- 
ilar content in a variety of situations by providing workers 
with guidelines on how to identify similar content for a par- 
ticular use case. However, crowdsourced opinions are rarely 
homogeneous and therefore must be aggregated into what is 
most likely the truth. This work presents the Dynamically 
Weighted Majority Vote method. A novel algorithm that 
combines aggregating workers’ crowdsourced opinions with 
estimating the reliability of each worker. This method was 
compared to the traditional majority vote method in both 
a simulation study and an empirical study, in which opin- 
ions on seventh grade mathematics problems’ similarity were 
crowdsourced from middle school math teachers and college 
students. In both the simulation and the empirical study the 
Dynamically Weighted Majority Vote method outperformed 
the traditional majority vote method, suggesting that this 
method should be used instead of majority vote in future 
crowdsourcing endeavors. 


Keywords 
Crowdsourcing, Similarity, Community Detection, Hierar- 
chical Clustering 


1. INTRODUCTION 


Within online learning platforms and intelligent tutoring 
systems there is a tremendous opportunity to utilize knowl- 
edge of content similarity. Similar problems can help prevent 
cheating during exams by randomly selecting from multi- 
ple similar problems when students receive the exam, mea- 
sure students’ learning gains by spreading out similar prob- 
lems between assignments, and measure the effects of in- 
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structional interventions by comparing a student’s scores on 
similar problems before and after the intervention. Similar 
instructional material can be used to offer students choices 
in which instructional material they receive, which has been 
shown to increase engagement and achievement [7]. While it 
is possible to implement these methods with general knowl- 
edge of content similarity, such as similarity in prerequisite 
knowledge or difficulty, if a more informed definition of con- 
tent similarity is used, the success of these methods is likely 
to grow. 


Although there is a lot of value in knowing what content is 
similar to other content, what content should be considered 
similar is highly dependent on use case. This makes it a 
challenge for content creators to define the similarity in the 
content, as they don’t necessarily know what their content 
will be used for. While some content is obviously similar, 
for example, two mathematics problems that are identical 
except for the numbers used in the problems, in other situ- 
ations it is much more difficult, especially when content is 
being aggregated from multiple sources that may not even 
use the same metrics for prerequisite knowledge or difficulty. 


Crowdsourcing offers a way to derive which content is similar 
to other content for specific use cases. Crowdsourced opin- 
ions on similar content can be gathered each time a new use 
case for similar content arises. By informing the workers, 
whose opinions are being crowdsourced, of the specific use 
case and requirements for similarity, the methods that rely 
on content being similar are more likely to be successful. 
However, crowdsourcing opinions on similar content poses 
some challenges as well. Before an online learning platform 
or intelligent tutoring system uses crowdsourced assertions 
of similarity, steps must be taken to assess the trustworthi- 
ness of workers whose opinions are being crowdsourced and 
ensure the truthfulness of the final assertions of similarity. 


In this work we present a novel algorithm that both mea- 
sures the reliability of the workers whose opinions are be- 
ing crowdsourced, and determines, from these individual’s 
opinions, what content is most likely to be similar to other 
content. To evaluate this method, we first simulated a wide 
range of conditions in which assertions of similarity were 
made, and compared the performance of our algorithm to 
the traditional alternative. We then performed a case study 
where teachers and college students were told to identify 
middle-school mathematics problems that evaluated a simi- 
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lar skill set. The assertions of similarity collected from the 
case study were used to identify groups of similar problems 
and measure the reliability of each worker’s assertions. 


Ultimately, this work seeks to answer the following three 
research questions: 


1. Can we exploit properties of community detection to 
more accurately form groups of content from crowd- 
sourced opinions? 


2. How does the resulting algorithm perform in a simula- 
tion study compared to the more traditional method? 


3. How does the resulting algorithm perform in a case 
study using workers of various expertise to determine 
which mathematics problems are similar to each other? 


2. BACKGROUND 


2.1 Ensembling Crowdsourced Opinions 
Identifying the truth from crowdsourced opinions is not a 
new problem. Most of the techniques employed to ensure 
the accuracy of crowdsourced opinions rely on ensuring that 
workers have sufficient knowledge of the subject matter. 
This can be done through testing workers before giving them 
tasks, tailoring tasks specific to their skill sets, recruiting 
high quality workers, and educating workers before assign- 
ing them tasks. This can also be done through encourage- 
ment with extrinsic motivators like money, promotions, or 
prizes, or intrinsic motivators like a sense of purpose, or by 
gamifying the crowdsourcing tasks [1]. 


While there are many methods to encourage individuals 
whose opinions are being crowdsourced to be accurate, this 
work is focused on how to validate the quality of individuals’ 
opinions after their task is complete. Current methods for 
accomplishing this place the burden of validation back onto 
the workers. Having workers rank the quality of other work- 
ers assertions is one method of validation. Another common 
method for validation is to have multiple workers perform 
the same task and merge the output of each worker, either 
as an average or as a majority vote [1]. 


There are also more advanced ways of algorithmically val- 
idating crowdsourced opinions. Item response theory and 
latent factor analysis based models have out-performed ma- 
jority voting based validation methods on tasks related to 
identifying facial expressions and answering questions about 
geography [6, 10]. These models also determine the quality 
of individuals whose opinions are being crowdsourced, which 
can be used to refine the pool of individuals used for future 
crowdsourcing tasks [6, 10]. The novel algorithm in this 
work also aggregates crowdsourced opinions while evaluat- 
ing the quality of each worker. 


2.2 Community Detection 

The field of community detection is focused around deter- 
mining groups of similar items from a network of connected 
items. This has many applications throughout mathemat- 
ics, physics, biology, computer science, and social sciences. 
Many things can represented as a network, for example, in- 
terstellar objects, neurons, city streets, and social media can 


all be represented as networks of interconnected items [3]. 
Finding similar educational content can be framed as a com- 
munity detection problem by representing educational con- 
tent as a network in which items are connected by topic, 
difficulty, language, prerequisite knowledge, or, in the case 
of this work, opinions on similarity. Structuring the task of 
identifying similar educational content as a community de- 
tection problem allows for the use of various well-established 
community detection algorithms, such as hierarchical clus- 
tering. In hierarchical clustering, each item begins in it’s 
own cluster. Then, clusters are merged based on the merge 
strategy and distance between clusters [5]. Hierarchical clus- 
tering was used in both the simulation and empirical study. 


3. METHODOLOGY 


3.1 Dynamically Weighted Majority Vote 

The Dynamically Weighted Majority Vote (DWMV) method 
is our alternative to the traditional majority vote method for 
combining multiple crowdsourced opinions on tasks with bi- 
nary outputs. The DWMV method calculates the weighted 
majority opinion for each task, then determines the weight 
of each worker by how closely their opinion agreed with the 
majority opinion. The closeness of a worker’s opinion to 
the majority opinion can be determined with any function 
for comparing two vectors that results in a value greater 
than or equal to zero. For example, accuracy or Dice co- 
efficient[2]._ DWMYV initializes all workers’ weights to be 
equal at the beginning of the algorithm, and iteratively up- 
dates these weights until the weighted majority vote does 
not change between iterations. Once the weighted majority 
vote remains constant from one iteration to the next, the 
weights of the workers can be interpreted as a measure of 
confidence in each worker, and the final weighted majority 
vote can be used downstream in the same way the traditional 
majority vote would have been used. Algorithm 1 formally 
defines the DWMV algorithm. In Algorithm 1, the func- 
tion s(x,y) determines the closeness of worker 7’s opinion, 
(Bij[Aij = 1])5=1, to the majority opinion, (u;[Ai; = 1])5—1. 
The algorithm requires a matrix A of response indicators, in 
which a;; = 1 if worker i completed task j, and aij = 0 
otherwise, and a matrix B of worker’s responses to tasks, in 
which b;; contains the binary response of worker i to task 7. 
In Algorithm 1, vector u contains the final weighted majority 
vote for each task, and vector c contains the final measure of 
confidence for each worker, based on the similarity between 
the weighted majority votes and the individual worker’s re- 
sponses. 


3.2 Simulation Study 

To determine if DWMV had a positive impact on forming 
groups from crowdsourced opinion, a simulation study was 
performed to compare the DWMV method to the traditional 
majority vote method in a variety of conditions. Figure 1 
illustrates the simulation process. In the simulation study, 
hierarchical clustering was used to form groups from simu- 
lated workers’ opinions of item similarity aggregated using 
both the majority vote method and the DWMV method. 
Table 1 lists the different initial parameters and their values 
used in the simulation. Five trials of every possible combi- 
nation of the values in Table 1 were simulated for a total of 
37,500 simulation runs. 
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Algorithm 1 Dynamically Weighted Majority Vote 


Require: s(x,y) : function for the similarity of two vectors 
Require: w : number of workers 

Require: t : number of tasks 

Require: A = (aw) : matrix of response indicators 
Require: B = (bw:) : matrix of response values 


v & (O)§-4 > initialize with values different from u 
ue (-l)jat > initialize with values different from v 
c+ (1)#1 > start with equal confidence in all workers 
while u 4 v do 
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Figure 1: A flowchart of the simulation process, DWMV and 
majority vote were compared to each other through their use 
in community detection through hierarchical clustering. 


Table 1: Simulation Parameters and Simulated Values 


Parameter Values 
a 50, 100, 150, 200 
g 5, 10, 15, 20, 25 
Wp 0.1, 0.2, 0.3, 0.4, 0.5 
Wyn 0.1, 0.2, 0.3, 0.4, 0.5 
p 20, 40, 60, 80, 100 
d 0.25, 0.5, 0.75 


The simulation began by randomly placing 7 items into g 
groups, where 7 and g are initial parameters of the simula- 
tion. Then the simulation crated ten workers. Each worker 
had a false positive rate and a false negative rate. These 
values were calculated separately to make the simulation 
more true to real life. In real life, it is not often that a 
worker would have an equal chance of incorrectly asserting 
that two items are or are not similar. The more likely case is 
that some workers think there is more similarity and other 
workers think there is less similarity between items than the 
actual similarity of items. The false positive and false neg- 
ative rates of the workers were sampled separately for each 
worker from a uniform distribution in the range [0, wyp] and 
[0, wen] respectively, where wy, and wyn are initial parame- 
ters of the simulation. Once the items were randomly placed 
in groups, and the error rates of the workers were randomly 
determined, a random p percent of all pairs of items were 
given to each worker, where p is an initial parameter of the 
simulation. Each worker then determined whether or not 
the items in each pair they received were similar to each 
other, taking into account their error rates. 


Once all workers asserted whether or not each item pair they 
were given contained similar items, the majority vote and 
DWMYV for the similarity of each item pair was calculated. 
The majority votes and DWMVs of item similarity were then 
used to form a network of item similarity, where each item 
is connected to every other item it was voted to be similar 
to. The majority vote network and DWMV network were 
both used to form groups through hierarchical clustering 
with Jaccard Index as the distance metric. Jaccard Index 
was used as the distance metric because Jaccard Index does 
not take into account true negatives [8]. Most items are not 
similar to each other, so a metric that takes into account 
true negatives would be over-inflated and not as informative 
in this context. After forming groups from the majority vote 
and DWMV similarity networks, the difference in accuracy, 
precision, and recall between the groups formed from the 
majority vote and DWMV similarity networks were used to 
determine if the DWMV method improved upon tradition 
majority vote. 


3.3. Empirical Study: Similar Problems 

In addition to a simulation, an empirical study was per- 
formed to compare DWMV to majority vote on a real crowd- 
sorcing task. In this study, middle school mathematics teach- 
ers and college students were given 50 seventh grade math- 
ematics problems from the Engage New York’, Illustrative 
Mathematics”, and Utah Middle School Math Project? cur- 
ricula. Each worker was told to identify problems that eval- 
uate similar mathematics skills. The workers’ crowdsourced 
opinions of similarity were aggregated using both DWMV 
and majority vote, and then grouped using hierarchical clus- 
tering, with Jaccard Index as the distance metric with a 
threshold of 0.75. The resulting groups were then compared 
to a ground truth, provided by ASSISTments, an online 
learning platform [4], in the form of Common Core State 
Standards Mathematics Skill Codes*, which each problem 


‘https: //www.engageny.org/ 

“https: / /illustrativemathematics.org/ 
3http://utahmiddleschoolmath.org/ 
‘http: //www.corestandards.org/ 
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was tagged with. These ground truth skill tags were de- 
termined by trained experts and the designers of the above 
stated curricula. The difference in accuracy, precision, and 
recall between groups formed with hierarchical clustering 
from DWMV and majority vote were again used to evaluate 
the quality of the DWMV algorithm. 


4. RESULTS 


4.1 Simulation Study 

To compare the DWMV method to the traditional majority 
vote method, the difference in accuracy, precision, and recall 
as a function of wry, Wn, t, g, and p, as described in Section 
3.2, were calculated. The first positive takeaway from the 
simulation is that DWMV was almost always more accurate 
than majority vote, regardless of the simulation parameters. 
Only when the simulation had more than twenty groups or 
the maximum false negative rate of workers was 20% or less 
did DWMV not reliably out perform majority vote, but it 
did not significantly underperform either. At most, DWMV 
was slightly less accurate than majority vote when workers 
had very low false negative rates. Interestingly this increase 
in performance was not shared by both precision and re- 
call. While recall followed the trend of accuracy and showed 
almost entirely positive improvements from using DWMV 
over majority vote, precision did not. 


Another interesting finding is that all three performance 
metrics increased as both the maximum false negative rate 
and fraction of links seen by workers increased. This implies 
that as workers answer more problems, and become worse 
at correctly identifying when items are similar, the benefit 
of using DWMV over majority vote increases. 


Overall, t-tests [9] showed that using DWMV led to a sta- 
tistically reliable (p < 0.001) 0.18% increase in accuracy, 
a statistically reliable 1.78% (p < 0.001) increase in recall, 
but no statistically reliable (p = 0.28) change in precision. 
While small, these reliable improvements in accuracy and 
recall over the traditional majority vote method are an in- 
dication of the potential positive effects of transitioning to 
using DWMV instead of majority vote when aggregating 
crowdsourced opinion. 


There were also some interesting differences in how different 
types of error affected the weights of workers as determined 
by the DWMV method. Figure 2 shows the average and 
95% confidence interval of the DWMV weights of workers 
as a function of the workers’ false positive and false nega- 
tive rates. The false positive rate of the workers seems to 
decrease their weight in the final weighted majority vote of 
the DWMV method much more quickly than their false neg- 
ative rate. A potential cause of this is that, in the simulated 
groups of similar items, there were far more pairs of items 
that were not similar to each other than there were pairs of 
items that were similar. For example, to have an equal num- 
ber of items that are similar and not similar to each other, 
each item would have to be similar to half the items. The 
only way to facilitate that in the context of this simulation 
would be to have only two equally sized groups of items. In 
the simulation there were always at least five groups, and 
up to 25 groups of similar items, which caused most prob- 
lems to not being similar to each other. Therefore, when 
a worker had a large false positive rate, there were more 
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Figure 2: The average and 95% confidence interval of the 
DWMYV weights of workers as a function of the workers’ false 
positive and false negative rates. 


opportunities for them to make a mistake compared to a 
worker with a large false negative rate. Additionally, the 
large number of dissimilar problem pairs compared to the 
number of similar problem pairs caused workers with very 
low false positive rates to have higher weights than work- 
ers with equally low false negative rates, because workers 
with low false positive rates, regardless of their false nega- 
tive rates, had much fewer opportunities to make a mistake. 
These findings suggest that the distribution of correct re- 
sponses in crowdsourcing tasks affects which type of worker 
error has a larger impact on workers’ weights in the DWMV 
method. 


4.2 Empirical Study: Similar Problems 

In total, six teachers and four students completed the crowd- 
sourcing task of grouping 50 seventh grade mathematics 
problems. Using each worker’s assertions of similarity, the 
DWMV method and traditional majority vote were used to 
aggregate the opinions of the workers into a final network 
of similarity, which was then used to create groups of simi- 
lar problems using hierarchical clustering. This is the exact 
same process that was used to form groups in the simu- 
lation study. Figure 3 shows the progressive iterations of 
DWMV. Iteration 1 shows the unweighted average of each 
workers assertions. The DWMV method’s process of iterat- 
ing between calculating a weight for each worker and calcu- 
lating the weighted majority vote shifted the weighted aver- 
age of workers assertions toward the ground truth similarity 
of problems. This convergence was present in the simulated 
example in Section 3.1 as well. The benefit of the DWMV 
method over traditional majority vote lies in this ability to 
converge towards ground truth. Figure 4 shows the weight 
of each worker as a function of their error rate. The cohort 
of middle school mathematics teachers performed much bet- 
ter overall than the cohort of college students. The average 
accuracy of the teachers was about 97% while the average ac- 
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Figure 3: Progressive iterations of DWMV converging on em- 
pirical data. 


curacy of the college students was only about 81%. Based on 
these weights, it is clear that the DWMV method valued the 
opinions of middle school mathematics teachers more than 
the opinions of college students, which is expected given the 
context and task. While, in this scenario, it might have been 
easy for a human in the loop to recognize that the teachers’ 
opinions should be valued more, it will not always be the 
case that one group of workers is clearly more qualified than 
another group, and thus the DWMV method can help elu- 
cidate which workers are the most reliable. 


Table 2 shows the difference in accuracy, precision, and re- 
call between groups formed through hierarchical clustering 
from the assertions of similarity aggregated using DWMV 
and traditional majority vote. Similar to the simulation re- 
sults, DWMV had the largest positive impact on recall, the 
second largest positive impact on accuracy, but no impact 
on precision. In this empirical study, both the traditional 
majority vote method and the DWMV method led to perfect 
precision, meaning all problems that were placed in groups 
together were similar to each other. However, traditional 
majority vote led to worse recall than DWMV. When tradi- 
tion majority vote was used, three of the 50 problems were 
not placed in a group with any other problems, which is why 
the recall was so low. However, when DWMV was used, only 
one problem was not placed in a group of similar problems. 
This outlier problem, that neither traditional majority vote 
nor DWMV was able to correctly identify as similar to other 
problems in its group, had the following text: 


22% of 65 is 14.3. What is 22.6% of 65? Round 
your answer to the nearest hundredths (second) 
decimal place. 


Below are examples of problems in the same group as this 
problem, which were all correctly identified as similar to each 
other. 
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Figure 4: DWMV’s confidence in each worker after the 
DWMV method converged. 


Josiah and Tillery have new jobs at YumYum’s 
Ice Cream Parlor. Josiah is Tillery’s manager. In 
their first year, Josiah will be paid $14 per hour, 
and Tillery will be paid $7 per hour. They have 
been told that after every year with the com- 
pany, they will each be given a raise of $2 per 
hour. Is the relationship between Josiah’s pay 
and Tillery’s pay rate proportional? 


To make a punch, Anna adds 8 ounces of apple 
juice for every 4 ounces of orange juice. If she 
uses 32 ounces of apple juice, which proportion 
can she use to find the number of ounces of or- 
ange juice x she should add to make the punch? 


A recent study claimed that in any given month, 
for every 5 text messages a boy sent or received, 
a girl sent or received 7 text messages. Is the re- 
lationship between the number of text messages 
sent or received by boys proportional to the num- 
ber of text messages sent or received by girls? 


Although all these problems are related to ratios and pro- 
portions, the other problems in the group with the outlier 
problem are longer word problems that do not explicitly 
use percentages. The teachers and students whose opin- 
ions were crowdsourced could have missed the connection 
due to the different wording in the problems, or they could 
believe that calculating percentages is a different skill than 
calculating proportions from word problems. Based on the 
differences between this single outlier problem and the other 
problems in its group, it is possible that the outlier problem 
was consciously excluded from its group and not simply an 
oversight. 


The impact of using DWMV was larger in this empirical 
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Table 2: A comparison of majority vote to DWMV used to 
form groups of similar problems from crowdsourced assertions 
of similarity. 


Metric | Majority Vote | DWMV | % Increase 
Accuracy 0.987 0.997 1.054 
Precision 1.000 1.000 0.000 

Recall 0.903 0.977 8.228 


study than it was in the simulation. In the simulation there 
was a larger than average improvement in accuracy and 
recall when the workers had very low false positive rates. 
Given that in this empirical study both sets of groups of 
similar problems had perfect precision, it is likely that the 
workers in this study had very low false positive rates, which 
likely contributed to why the positive impact of using DWMV 
instead of majority vote was larger in this empirical study 
than in the simulation as a whole. The results of this em- 
pirical study suggest that not only can DWMV out-perform 
traditional majority vote in simulations, but can also im- 
prove the recall and accuracy of groups of similar problems 
formed from crowdsourced opinions on content similarity in 
real-life scenarios as well. 


5. CONCLUSION 


Within online learning platforms and intelligent tutors, there 
is tremendous utility to knowing what content is similar to 
other content within the platform, but each application of 
similar content is likely to have different criteria for what 
is considered similar. Crowdsourcing opinions on the sim- 
ilarity of content is an accessible way for new applications 
to recognize similar content. However, crowdsourcing poses 
some difficulties, namely, how to identify reliable workers 
and properly aggregate opinions from multiple workers. This 
work has demonstrated the ability of the Dynamically Weigh- 
ted Majority Vote method, a novel algorithm for aggregating 
crowdsourced opinion while rating workers, to accomplish 
those goals. DWMV has been shown, in both a simula- 
tion study and an empirical study, to lead to higher accu- 
racy and recall that the traditional majority vote method 
on crowdsourcing tasks related to identifying similar con- 
tent. In the simulation study, using DWMV before identi- 
fying groups of similar items through hierarchical clustering 
resulted in a statistically significant 0.18% increase in accu- 
racy and a 1.78% increase in recall over using majority vote. 
The simulation study also revealed how the distribution of 
correct responses in the crowdsourcing tasks effects how the 
false positive and false negative rates of workers effects their 
weight in the DWMV method. In the empirical study, us- 
ing DWMV before identifying groups of similar problems 
through hierarchical clustering resulted in about a 1% in- 
crease in accuracy and an 8% increase in recall over using 
majority vote, and provided perspective on the differences 
in accuracy between the expert middle school math teach- 
ers and the novice college students. Moving forward, when 
faced with the need to aggregate crowdsourced opinions, the 
learning science community can look to the DWMV method 
as an alternative to the traditional majority vote method. 
The DWMV method is a promising tool for increasing the 
reliability of crowdsourced opinion and, when paired with 
hierarchical clustering, identifying groups of similar content. 
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ABSTRACT 


Measuring similarity of educational items has several ap- 
plications in the development of adaptive learning systems, 
and previous research has already proposed a wide range of 
similarity measures. In this work, we provide an experimen- 
tal evaluation of selected similarity measures using a large 
dataset. The used items are alternate-choice questions for 
the practice of English grammar for second language learn- 
ers; the dataset contains thousands of items and over 10 
million student answers. Our results provide warnings about 
the generalizability of results presented in EDM works: 1) 
the results vary significantly between knowledge components 
and 2) the size of available data is an important factor. 


Keywords 


item similarity, evaluation, generalizability 


1. INTRODUCTION 


Learning environments often contain thousands of educa- 
tional items (questions, problems). A useful data mining 
contribution is to quantify the pairwise similarity of these 
items [9]. Such similarity measures have many applications. 
There are useful particularly for the management of the con- 
tent, e.g., adding and deleting new items, preparing and 
revising explanations and hints, or deciding when to split 
knowledge components. Similarity measures can also be 
used in algorithms that guide the presentation of the con- 
tent, e.g., in the presentation of error explanations, it may be 
useful to group similar items together; in sequencing items, 
we may want to avoid giving students two very similar ques- 
tions in close succession. Item similarities may also be used 


for student modeling [6| [12]. 


Item similarity can be computed in many ways |9]|; the basic 
two approaches are to use the item content data (e.g., the 
text of the question) and student performance data (e.g., the 
correctness of answers and response times). The content- 
based measures are, to a large degree, dependent on the 
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specific type of data. The techniques based on student per- 
formance data are content-agnostic and widely applicable; 
the disadvantage is that they require (potentially large) stu- 
dent data. Previous research has proposed several specific 


measures . 


In this work, we focus on the evaluation of previously pro- 
posed measures on a large and interesting dataset. The used 
items are alternate-choice questions for the practice of En- 
glish grammar for second language learners (see examples in 
Table{i). The dataset contains thousands of items, which are 
categorized into knowledge components and difficulty levels. 
The items are alternate-choice questions, i.e., they consist of 
a stem, correct answer, and a single distractor. Items also 
have explanations, which are written in the Czech language. 
The dataset contains approximately 10 million student an- 
swers. 


For this dataset, we evaluate various similarity measures and 
explore their relations. We focus particularly on the relation 
between performance-based measures and measures based 
on the text of explanations. We explore the issue of the suffi- 
cient size of data on student performance. In EDM research, 
this issue is often neglected; the performance of techniques is 
often studied using a fixed dataset (“all available data”). Our 
experiment shows that the studied methods are quite data- 
hungry; they require thousands of answers per item and the 
amount of available data seems to be more important than 
differences caused by choice of a measure (which is a type 
of result common with other machine learning applications 
[4)). Experiments also show large differences in results 
between different knowledge components, even though all 
of these knowledge components come from a single domain 
(English grammar) and all the used items are of the same, 
simple format (alternate-choice questions). This result pro- 
vides a warning about the generalizability of research results 
in educational data mining. 


2. EXPERIMENTAL SETTING 


In this section, we describe the data we used for experiments 
and the specific similarity measures. 


2.1 Data 


For the evaluation, we use data from the adaptive learning 
system Umime anglicky, umimeanglicky.cz. The system 
contains various exercises for English grammar and vocab- 
ulary learning for second language learners (for Czech na- 
tive speakers). We use only one type of exercise—alternate 
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Table 1: Examples of items from the knowledge component Present simple vs. present continuous. For the sake of 
readability, explanations are given here in English; in the used data, they are in the Czech language. 


item stem correct distractor explanation 

I _ to the gym once a week. go am going When talking about periodical events, we use 
present simple tense. 

I _ the film that we’re watching. hate am hating The verb to hate is not used in continuous form. 


I can’t hear you! Everybody ~_ so loudly. 


choice question of the form fill-in-the-blank with two options 
(the correct answer and a distractor). The number of op- 
tions is not crucial and our analyses could also be applied 
to questions with multiple distractors. The questions have 
explanations (in the Czech language). 


The questions are divided into item sets. Each item set 
contains questions of similar difficulty from a single knowl- 
edge component. The system uses three difficulty levels. An 
example of an item set is Present simple vs. present contin- 
uous, medium difficulty, for which examples of questions are 
provided in Table [I] 


Our dataset consists of 54 knowledge components divided 
into 68 item sets that in total contain 4348 items. Some 
item sets share the same knowledge component, and they 
only differ in the difficulty of items. Concerning student 
performance, we use the answer (correct or incorrect) and 
response time (measured in milliseconds). We have 9 752 957 
answers from 151904 students. 


Since details of data collection can often have a nontrivial 
impact on the results of the evaluation (I, we provide a 
basic description of the core aspects of system behavior that 
influence the collected data: 


e In the system, students answer a sequence of items 
from a single item set in random order. 


e The system uses mastery learning on the level of item 
sets. Students are motivated to answer a sufficient 
number of items correctly to satisfy the mastery crite- 
rion. 


e The choice of an item set that a student solves can 
be done in a variety of ways: student free choice, as- 
signment by a teacher (homework, assignment within 
a class), or recommendation by the system (based on 
past activity). 


e The item sets differ widely in their difficulty. The 
samples of solvers may differ significantly for individ- 
ual item sets (e.g., Second conditional, hard is solved 
by more advanced students than Present simple tense, 
easy). 


e Items may move between difficulty levels (“design level 
adaptivity” fi]). This aspect may be important for 
some measures. 


is talking talks 


We use present simple form instead. 
When the activity is still in progress, we use 
present continuous tense. 


2.2 Similarity Measures 
In our experiments, we use similarity measures that are vari- 
ations on previously studied measures [9| 


2.2.1 Measures Based on Item Content 

One type of measure utilizes that available data about items. 
One possibility is to utilize item statements, e.g., to measure 
the similarity of item texts or match on options (the correct 
answer and distractor). In the case of grammar learning, 
this approach is hard to use: two questions that practice 
the same grammar rule can have completely different texts, 
answers, and distractors. We have performed preliminary 
experiments with various measures based on item text; these 
experiments showed very weak results. Therefore, we do not 
discuss these measures in more detail. 


A more applicable content data are explanations. In the 
used dataset, each item has an associated explanation shown 
as feedback to students (particularly when they make a mis- 
take). To quantify similarity based on explanations, we com- 
pute the text similarity of the explanations. To do so, we 
considered two common methods: Levenshtein edit distance 
and Jaccard index. 


Both methods compute the pairwise similarity of two expla- 
nations. Levenshtein edit distance operates at the character 
level and computes the minimal number of edits (character 
addition, removal, and substitution) required to transform 
one explanation into another explanation. Jaccard index 
only compares sets of words appearing in the two explana- 
tions regardless of their position. It is defined as 


|E1 9 B2| 
|Fy U Bp| 


where EF} is a set of words in one explanation and E2 is a 
set of words in another explanation. 


2.2.2 Measures Based on Student Performance 

For computing similarity based on student performance, we 
consider two basic aspects: the correctness of answers and 
response times. These aspects are easy to collect and rele- 
vant for a vast range of items. In our experiments, we use 
similarity measures based on either of the two types of data 
and their combination. 


Answer Correctness. The correctness of a student’s an- 
swer is a simple binary indication of whether the student 
has answered an item correctly (selected the correct option 
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Table 2: Agreement matrix for items i and j. Values a, b, c, 
and d are numbers of students that answered both items in 
a particular way. For example, c is number of students that 
answered item i correctly but answered item j incorrectly. 


item 2 


n=a+b+c+d : 
correct incorrect 


tem J inconact ed 
5. = (ad — bc) 
"Ja + o)(a + )(6 + d)(c + d) 
%="0=P) 
P,- (a+ d) 
a (a+ (ate) (+ d)(e+d)) 
& (ad — be) " 


~ (ato(e+d) 


in our case). Similarity measures based on the answer cor- 
rectness then measure “agreement” between answers given 
by the same students to different items. This is best illus- 
trated on an agreement matrix for two items i and 7. There 
are only four possible ways a student can make binary re- 
sponses to two items, as illustrated in Table Similarity 
measures then differ in how exactly they compute the agree- 
ment from the individual components of the matrix. In our 
experiments, we use Pearson correlation coefficient (S,), Co- 
hen’s Kappa (S.) I, and Kappa Learning |7) (Sz7). 


Answer correctness measures can be extended by including a 
“second step“ [9], i.e., computing similarity of similarities. In 
the first step, binary vectors of student answers for two items 
are compared to obtain the two items’ similarity. The result 
is a similarity matrix with real-valued elements s;,; equal to 
the similarity of items i and 7. The second step compares 
real-valued vectors s;,x and s;,. to obtain similarities of items 
z and j. In our experiments, we use Pearson-Pearson which 
is a Pearson correlation coefficient used in both first and 
second step. 


Response Time. Response time is measured as the time it 
takes a student to answer the item (read the item statement 
and click on one of the options in our case). Student re- 
sponse times can vary due to external distractions during 
answering or even technical reasons like unreliable internet 
connection. To make the measure more robust, we opted 
to bin each item’s response times into percentiles. The sim- 
ilarity of two items i and j is then measured as Pearson 
correlation coefficient of student response time percentiles 
vectors for items 7 and 7. 


Combined. Both correctness and response time can be com- 
bined to extract more bits of information. There are mul- 
tiple ways to combine correctness and response time into a 


single score [9]. In our experiments, we use linear time trans- 
formation for correct answers as a combined score defined 
as r = c-max(1— t/27r),0) where c € {0,1} is correctness, 
t € R® is response time, and 7 is the median time for a 
given item. Similarities of items i and 7 are then Pearson 
correlation coefficient of score vectors for items 7 and 7. 


Table 3: Overview of all item similarity measures used in this 
study. 


name measure type data used 
Levenshtein edit distance content explanations 
Jaccard index content explanations 
Pearson corr. coef. performance correctness 
Cohen’s Kappa performance — correctness 
Kappa Learning performance correctness 
Pearson-Pearson performance correctness 
Response time percentile performance response time 

: correctness + 
Response time score performance 


response time 


3. RESULTS 


In this section, we present our findings. We use the expla- 
nations as “ground truth” for item similarity. The reasoning 
is that explanation describes the aspect of knowledge com- 
ponent that the item is practicing, and similar aspects are 
described in a similar way (e.g., same tense or conditional). 
This approach has its limitations, and it is heavily depen- 
dent on the quality of explanations. Not all explanations are 
necessarily ideal (different granularity between knowledge 
components, human errors), but it is a reasonable proxy. 


For intuition behind the performed evaluation, Figure[1]pro- 
vides an illustration using two knowledge components. The 
figure shows a PCA projection of items into plain based on 
the Pearson similarity measure that uses only the correct- 
ness of answers. The color of points is based on the expla- 
nations provided in the system. As we can see, these two 
approaches to measuring item similarity to a large degree 
agree—the points with the same color (similar with respect 
to explanations) are close to each other (similar with respect 
to performance). We now explore these relations in a more 
qualitative manner. 


3.1 Relations Among Measures 

ee oe an overview of measures introduced in Sec- 
tion Other measures can be defined in a similar fashion. 
An obvious question is whether they differ in any significant 
way or measure the same thing. To explore relations among 
measures, we first look at how much they are correlated. 
The correlation of two measures is computed as the Pear- 
son correlation coefficient of item similarity matrices, each 
produced by applying item similarity measure to all pairs of 
items. A high correlation of two measures means that they 
generally agree on which pairs of items are similar. 


F igure [2] shows correlations among measures based on per- 
formance and explanation averaged across all item sets. Both 
explanation-based item similarity measures are strongly cor- 
related, and they also have comparable correlations with all 
performance-based measures. Therefore, it is not important 
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First conditional 


If it rains, |_ go to the park. 
é 


If | see him, |_ tell him. 
If | leave now, |_ get home at 10 o'clock. e 


If you turn on the lamp, you _ able to see better, 


We _ if we don’t like the concert. 
e 


e e 
She _ you if you want. e 
ee , 4! _ early if you want. If | bake a cake, _ you have some? 
e 


|_ late, if you don't help me. 
e % 


4 there if she does. If |_ your book, I'll give it back to you. 
a e 


e 
& If | go out tonight, |_ to the cinema. e 
e O e 
elf | find your ring, I _ it back to you. o 
e e 


e 
$ If you _ me, | will manage to finish that. 
|_ Marry, if she is free tomorrow. . ° brs | yi Raccdninncmmelbace 
cee 


If you _ late, you will be punished.e © 


e 
Ifle enough money, | will buy a new car. 
If it__ sunny tomorrow, | will ride a bike 3% 


e 
| will go to work tomorrow, if |_ better. 


e 
If 1 _ there again, | will buy the umbrella. 


Past simple tense (regular verbs) 


He _ to come. . 


We _ it. 
owe _ the party. 


te _ football. 
Yesterday, he _ a pen. 


ory @!_the trumpet. 
He _ to marry her last year. 
He _ his hands. . The baby _. 
e She® the window. ~e 
‘ e He _ my bike. 
git _ alot. He _ hard. 
e “ 5 e 
e “Children _ quickly. s 
5 e ate _a box. 
e 
s 
°We_. |_ play football yesterday. 
She _ to him few minutes ago. 


e 
e 
e 
SNe _ “ol _ for the school bus. e 
“Last week, she _ the painting. 
e 


e . 
She _ her hair yesterday. 
ete _ my pen, 


The meeting _ late. We _ all night. 


4 Haar , 
We _ together.” We _ in England in 2015. 


Figure 1: PCA projections based on measures using perfor- 
mance data (Pearson correlation). Points with the same color 
share the same explanations. 


which one we choose as the ground truth for later experi- 
ments. This result is not surprising as both measures quan- 
tify text similarity, albeit in a different way. 


Answer correctness measures Cohen’s Kappa and Pearson 
behave almost identically, and their correlations across item 
sets are 0.96 or higher. The Kappa Learning measure also 
behaves similarly and has high correlations with both mea- 
sures dropping below 0.75 only for one item set. When 
compared to explanation-based measures, all three measures 
achieve the same result. In most cases, it is not important 
which of the three we choose, and the amount of available 
data is a much more important factor (more details in Sec- 


Mean measure correlations 


Kappa Learning 1.00 0.89 0.85 [eReRM(h-yAiemyarena) iia 
Cohen's Kappa 0.99 ont Rie hey = eas ane hy ays 0.8 
Pearson corr. coef. smeley 0.48 0.56 0.11 0.26 
Pearson-Pearson [OS SMIOCEH(ONT) 1.00 [OMMANOKers IORI) oe 
Response time score |@MepAeMet-eRe\-ykemg 1.00 (enc Aropyz.) oA 
Response time percentile [@RiyAiemyagtopn hoe}: } PED] 1.00 [ty 
Levenshtein edit distance [OMAICwAMehyX-yi onic a ehy 7 Soba ie) 0.2 
Jaccard index [ORAa Kens: ehys:alehy alae hysswaem ay 0.93 
0.0 
Sas 
we fo) 


Figure 2: Heatmap of correlation among measures averaged 
across 68 item sets. 


tion 3.2). This result is in contrast to previous research (7, 
which argued that the Kappa Learning measure brings im- 
portant improvement. 


The second step similarity Pearson-Pearson has mostly the 
same or worse correlation with explanation-based measures 
compared to the previous three measures. It is related to 
Pearson and Cohen’s Kappa, with correlation ranging from 
0.3 to 0.8 for most item sets. The correlation with explanation- 
based measures is weaker compared to other measures using 
correctness. Thus for the used dataset, the second step does 
not seem useful. This observation is in contrast to previous 
research in another context [17]. 


The measures with response time do not provide any tan- 
gible benefits. When compared to explanation-based mea- 
sures, they achieve either similar correlations in case of Re- 
sponse time score or very poor and mostly zero correlation 
in case of Response time percentile. A combination of an- 
swer correctness and response time in Response time score 
results in the best correlation for some item sets, but it is 
not significantly different on average. These results suggest 
that answer correctness might be a better indication of item 
similarity for our dataset. 


3.2 Size of Data 


Item similarity measures based on student performance are 
based on statistics of student performance data. All statis- 
tics need at least some amount of data to become stable and 
to start approximating the true statistical feature of the un- 
derlying data generating process. The question is then, how 
much data, i.e., answers per item, is required to obtain a 
good stable approximation? 


InF igure[3| we have visualized the stability of performance- 
based measures in terms of correlation with the explanation- 
based measure. To simulate different numbers of answers, 
we have started with knowledge components with a suffi- 
cient amount of data and randomly subsampled each item’s 
answers. We report correlation with an explanation-based 
measure; we report only the Jaccard index as it is highly cor- 
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Past continuous tense Present perfect tense First conditional 


Performance based measures 
0.30 — Cohen's Kappa 0.35 
5025 0.40 —— Kappa Learning 0.30 
s — Pearson corr. coef. 
3 0.20 0.30 —— Pearson-Pearson 0.25 


— Response time percentile 
0.15 —— Response time score 


0.10 


0,05 


250 500 750 1000 1250 1500 1750 2000 0 500 1000 1500 2000 2500 250 500 750 1000 1250 1500 1750 2000 


Past simple tense (regular verbs) Past simple vs. past continuous Past simple vs. present perfect 


Measures correlation 
° 
w 
6 


0 500 1000 1500 2000 2500 3000 3500 4000 oO 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000 
Number of answers per item Number of answers per item Number of answers per item 


Figure 3: Correlation between performance-based measures and Jaccard index with an increasing number of answers per item 
across multiple knowledge components. Note that y-axis ranges differ between plots. 


related with Levenshtein edit distance and has higher mean Features of knowledge components describe how students 
correlations with performance-based measures. use the knowledge component to answer an item. One such 
feature is how much the component is rule-based. There are 
F igure [3] shows that performance-based measures are data- more factual components, e.g., Past simple tense of irregu- 
hungry. There are nontrivial differences in correlations until lar verbs, and more rule-based components, e.g., Past simple 
2000 answers per item, and some improvement can be ob- tense of regular verbs. In our data, more rule-based compo- 
served even for more data. The general shape of the curves nents achieve higher correlations on average. For example, 
is mostly similar across multiple knowledge components and Past simple tense of regular verbs achieved a correlation of 
final achieved correlations. There are a few changes in the 0.63 while Past simple tense of irregular verbs achieved only 
relative ordering of measure, but these could be partly at- a correlation of 0.32. 
tributed to random noise for low data quantities. Differ- 
ent answer correctness measures have similar correlations The difference in student populations is especially impor- 
regardless of data available. Response time score measures tant in systems that target a wider audience. The audi- 
utilize more information from the data, and thus we ex- ence of item sets in our dataset range from grades 4 to 10, 
pected them to converge faster. This, however, does not and thus the student population solving each item set differ. 
happen. Simpler item sets for grades 4 to 7 achieve a better correla- 


tion of performance and explanation-based measures, while 
more advanced item sets for grades 8 to 10 achieve lower 


3.3 Differences among Knowledge Components 
correlations. 


There are significant differences in the best achieved cor- 
relations among knowledge components. The best correla- 
tion achieved between any performance-based measure and 
explanation-based measure for a given knowledge compo- 
nent ranges from 0.06 to 0.67. Even if we filter out item 
sets with fewer than 2000 answers per item, the best cor- 
relation achieved are still between 0.25 and 0.67. More- 
over, the ordering of performance-based measures in terms 
of achieved correlation with explanation measures differs be- 
tween knowledge components. For example, Response time 
score with Levenshtein edit distance has the best correlation 
0.61 for Present simple tense but the same pair has the worst 
correlation 0.06 for Passive voice. Therefore, the choice of 
knowledge component is more significant than the choice of 
similarity measures. 


Our dataset comes from a system that continuously evolves 
and has its content modified. These modifications also in- 
clude the addition of new items among existing items. This 
poses a challenge for measuring similarity from performance 
data. Groups of items with varying amounts of collected 
data can make recently added items artificially different from 
the rest. For example, item set Past tense: questions and 
negative has 63 items with around 1700 answers per item 
and 20 newly added items with only around 800 answers 
per item. The best correlation between performance- and 
explanation-based measures rises from 0.3 to 0.36 when we 
filter out newly added items. 


There is a multitude of factors causing these differences. 4. DISCUSSION 

We have identified some of these factors and give examples In this work, we have evaluated previously proposed mea- 
of their effect on correlations. The identified factors are sures for quantifying educational items’ similarity based on 
features of the knowledge component, differences in student students’ performance. We have used a large dataset from a 
populations, and biases in data caused by the addition of widely used learning system. The results provide important 
content to the system. warnings for both practitioners and researchers. 
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Many educational data mining techniques require a large 
size of data for good performance. However, research pa- 
pers often do not provide any indication of what size of data 
is good enough. Our results show that performance-based 
measures are data-hungry and may require upwards of 2000 
answers per item before converging. Results reported on 
smaller datasets thus may be misleading in some aspects. 
Note that even a large university class would mean only 
around 200 answers per item which is still an order of mag- 
nitude smaller than the required 2000. 


Another understudied issue is the generalizability of results 
across knowledge components. Our dataset is in many as- 
pects very homogeneous: we consider only alternate-choice 
questions for English grammar. Nevertheless, there are non- 
trivial differences between the knowledge components (rule- 
based vs. fact-based, simple vs. advanced), and we have 
observed significant differences in results depending on the 
choice of a knowledge component. This observation raises a 
question of the generalizability of results reported on just a 
few knowledge components. 
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ABSTRACT 


Many modern anatomy curricula teach histology using vir- 
tual microscopes, where students inspect tissue slices in a 
computer program (e.g. a web browser). However, the edu- 
cational data mining (EDM) potential of these virtual micro- 
scopes remains under-utilized. In this paper, we use EDM 
techniques to investigate three research questions on a vir- 
tual microscope dataset of N = 1,460 students. First, which 
factors predict the success of students locating structures in 
a virtual microscope? We answer this question with a gener- 
alized item response theory model (with 77% test accuracy 
and 0.82 test AUC in 10-fold cross-validation) and find that 
task difficulty is the most predictive parameter, whereas stu- 
dent ability is less predictive, prior success on the same task 
and exposure to an explanatory slide are moderately pre- 
dictive, and task duration as well as prior mistakes are not 
predictive. Second, what are typical locations of student 
mistakes? And third, what are possible misconceptions ex- 
plaining these locations? A clustering analysis revealed that 
student mistakes for a difficult task are mostly located in 
plausible positions ("near misses’) whereas mistakes in an 
easy task are more indicative of deeper misconceptions. 
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1. INTRODUCTION 


Histology is a core subject that all medicine students have to 
pass in their studies. An important part of classic histology 
training is the microscopy course where students examine a 
large number of slides of human or animal tissue with an op- 
tical microscope in order to identify cellular structures with 
the aim of establishing structure-function relationships [21]. 
In recent years, more and more virtual microscopes (VMs) 
have been developed and integrated into teaching [21]. Such 
VMs reduce the need for resources (students only require a 
computer and a software), offer the opportunity to annotate 
slides with teacher notes, and enhance the student learn- 
ing experience [21]. Prior work has provided numerous case 
studies of VMs being successfully integrated into anatomy 
education around the globe, e.g. [5, 6, 10, 13, 21, 22]. More- 
over, several evaluation studies have shown that students 
using VMs perform at least as well as students using optical 
microscopes [11, 15]. 


To the best of our knowledge, no study to date has con- 
sidered the educational data mining potential of VMs. For 
example, VMs enable us to record which slides students have 
seen, which areas on the slides they have focused on, etc. In 
this work, we consider the MyMi.mobile VM that is used 
in anatomy courses at two German universities [10]. In this 
VM, students can view a slide with expert annotations (ex- 
ploration), and they can test their knowledge by either locat- 
ing a structure in a slide (structure search; refer to Figure 1), 
or identifying the tissue sample and staining (diagnosis). 
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We analyze the performance of N = 1,460 students in struc- 
ture search tasks with respect to three research questions: 


RQ1: Which features predict student success? 
RQ2: What are typical locations of student mistakes? 


RQ3: What are possible misconceptions explaining these lo- 
cations? 


To answer RQ1, we analyzed the collected learning data with 
a generalized item response theory [2, 12] model, which con- 
sists of a difficulty parameter for each task, an ability pa- 
rameter for each student, and four weights for additional 
features (see section 3.2). To answer RQ2 and RQ3, we 
employed a Gaussian mixture model [7] on the locations of 
mistakes and interpreted the resulting clusters with the help 
of domain experts. Our results can contribute to enhanced 
teaching quality in VM courses as well as establish inter- 
pretable models to analyze data from such courses. 


In the remainder of this paper, we cover related work, our 
experimental setup, the results, and a conclusion. 


2. RELATED WORK 


Prior work on machine learning on virtual microscope data 
has focused on applications outside education. For example, 
major prior work has been done in training convolutional 
neural networks to solve classification tasks on microscope 
images such as detecting fluorescence on images [4]. Suc- 
cessful applications can assist anatomy experts in predicting 
carcinogens in human cells [23]. Due to the high accuracy 
of these models [1], they are helpful in cancer diagnostics. 


Related to education, prior work of virtual microscopes can 
be roughly distributed into two categories. First, there are 
case studies describing how virtual microscopes were inte- 
grated into anatomy curricula and the requirements for suc- 
cessful integration, e.g. [5, 6, 10, 13, 21, 22]. Second, several 
studies have investigated whether students with optical mi- 
croscopes have higher learning gain compared to students 
with a virtual microscope and found that this is not the 
case, e.g. [11, 15]. 


One of our research questions in this paper is to identify 
factors that are related to success in locating structures in 
a virtual microscope. Models that predict student success 
are a common topic of educational data mining research [3]. 
For example, Dietz-Uhler et al. [8] summarized which kind 
of data is often used to predict students success, classified 
into data gathered from the Learning Management System 
(e.g. clicks on resources) and performance data (e.g. feed- 
back or grades, created by the instructor or respectively the 
system). Other papers use demographic data and prior suc- 
cess to predict success rates, e.g. [16]. Prior work has shown 
that, depending on the knowledge domain, different features 
have high importance to predict students’ success. For ex- 
ample, Ramos et al. [20] found that hits in a discussion forum 
have high importance to predict students success. Yuksel- 
turk et al. [24] used a correlational research design and 
concluded that self-regulation variables have a highly sta- 
tistically significant relation to learning success using inter- 
pretable methods. To our best knowledge, there is no prior 
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Figure 1: Screenshot of the MyMi.mobile structure search 
mode. 


work on success prediction in virtual microscopes. We want 
to close this research gap. 


To do so, we turn to item response theory. Item response 
theory is concerned with modeling the probability of success 
of a student 7 at a task 7 via a logistic distribution over the 
difference between a student’s ability parameter 0; and a 
task’s difficulty parameter b; [2, 12]. Generalizations of this 
model include more parameters and other distributions [2, 
12]. In this paper, we use the standard logistic distribution 
but include auxiliary parameters for features that capture 
student behavior. 


To analyze the locations of typical mistakes, we perform a 
clustering analysis using Gaussian mixture models [7]. Clus- 
tering is a well-established technique in educational data 
mining [3], e.g. to identify groups of student solutions that 
may warrant similar feedback [9]. Our reasoning is similar: 
We wish to identify typical locations of mistakes in structure 
searches such that we have a reasonably sized set of repre- 
sentative locations that a teacher can inspect and for which 
feedback may be developed. 


3. METHOD 


3.1 MyMi.mobile VM and Dataset 


The MyMi.mobile VM provides three modes: exploration, 
which shows expert annotations, structure search, where stu- 
dents need to locate a structure in a slide, and diagnosis, 
where students need to identify the slide and the stain. The 
structure search mode is shown in Figure 1. Students see 
a tissue slice and are supposed to move the field of view 
(by panning and zooming) until the crosshair is located over 
the correct structure. Then, they confirm their choice by 
clicking the arrow on the bottom right. As additional in- 
terface elements, students see an explanatory text at the 
bottom of the screen (“Position the area to be searched in 
the center of the screen and confirm your decision by press- 
ing the ’continue-button’. Start now!”), a ’minimap’ of the 
slide on the bottom left, and a timer on the top right. Stu- 
dents can select structure searches in any order from a list 
sorted alphabetically according to the slides (e.g. armpit, 
eye, colon)’. Students can attempt the same search as many 


‘The alphabetical ordering probably introduces an ordering 
bias. In particular, we observe that the two most attempted 
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Figure 2: Illustration of the feature vector Zz (top) for an 
attempt of student i on task j, and the parameter vector w 
(bottom) of the item response theory model. 


times as they want. 


We consider a dataset of 19,525 structure search attempts 
by 1,460 students recorded in the summer term 2020 at two 
German universities. Most students were second semester 
undergraduate students of medicine, with some students 
from fourth semester dentistry (45) and molecular medicine 
in the second or fourth semester (39). Most students (817) 
did not attempt any structure search. Of the 643 who did, 
most attempted few structure searches (median 7) with some 
*heavy users’ making hundreds of attempts (mean 30.37, 
maximum 649). 68.19% of attempts were successful. 


For the purpose of validating our model, we also asked four 
anatomical teachers using the VM to rate the difficulty of 
the 30 most attempted structure search tasks on the plat- 
form. The teachers received the following instruction: any 
structure search that at least 65% of students are expected 
to solve on their first try should be rated as ’easy’; any struc- 
ture search between 40 — 65% should be rated ’moderate’; 
and any structure search below 40% should be rated as ’dif- 
ficult’. These boundaries were chosen based on the actual 
success rates of students: 10 of the tasks had an actual suc- 
cess rate over 65%, 10 had an actual success rate between 
40% — 65%, and 10 had an actual success rate below 40%. 


3.2 Item Response Theory 

In order to investigate RQ1, we trained a generalized item re- 
sponse theory model implemented via logistic regression. In 
particular, we pre-processed each structure search attempt 
to be represented as a 1,859 dimensional, highly sparse fea- 
ture vector (see Figure 2). The first four dimensions (gray) 
contain auxiliary features, namely: 1) How often has the 
student failed on the same structure search? (failures) 2) 
How often has the student succeeded on the same struc- 
ture search? (successes) 3) Has the student already seen 
the same slide in the exploratory mode? (explored), and 4) 
How many minutes has the student spend on the current 
structure search? (duration). The next 1,460 dimensions 
(blue) indicate which student made the attempt, i.e. fea- 
ture r44; = 1 if the current attempt was made by student 
i € {1,...,1460} and 0 otherwise. The remaining 395 fea- 
tures (orange) indicate which task the attempt was made 
on, i.e. feature ©14644; = 1 if the current attempt was made 
on task j € {1,...,395} and 0 otherwise. 


structure searches are both on the alphabetically first slide. 


Our model, then, has the form 


1 
+ exp(—wr? - Z) 


1 
1+ exp(b; — 0; —wi-a1...— wa- £4) 


where Z is the sparse feature vector of an attempt and W is 
the parameter vector (see Figure 2). Note that we obtain a 
classic IRT model if the first four features 71, v2, 73, and 
x4 are 0. We used the implementation of logistic regression 
from the scikit-learn library [19]. 


3.3 Clustering 

To investigate RQ2 and RQ3, we applied clustering on the 
locations of mistakes. More specifically, we used a Gaussian 
mixture model with kK components, which approximates the 
probability density over locations (x,y) of mistakes in an 
image as 


K 


p(x, y) = S>N((a, y) |e, Ex) ‘Tk, (2) 


k=1 


where N((x,y)|/ix, 2x) denotes the 2D Gaussian density 
with mean fi, € R° and covariance matrix Uy, € R?*?, and 
where 7, € [0,1] is the prior for the kth Gaussian compo- 
nent. Compared to other clustering algorithms, Gaussian 
mixtures have at least two advantages. First, they can deal 
with non-spherical clusters by adjusting the covariance ma- 
trix accordingly. Second, they provide a probability density 
of the data. Moreover, they remain fast to train with an 
expectation maximization scheme [7]. We use the scikit- 
learn implementation of Gaussian mixtures [19]. To select 
the optimal number of components K, we use the Bayesian 
information criterion [18]. 


4. RESULTS AND DISCUSSION 


In this section, we present the results of our experiments. 
We begin with the teacher difficulty ratings, then continue 
with the item response theory model (regarding RQ1), and 
conclude with the clustering analysis (regarding RQ2 and 


RQ3). 


4.1 Teacher difficulty ratings 

As the result of our teacher survey, we obtained difficulty 
ratings (’easy’, ’moderate’, or difficult’) for the 30 most at- 
tempted structure search tasks. We observe that the teach- 
ers agreed moderately. On average, the Kendall 7 for pair- 
wise agreement is 0.4 and the overall Krippendorff’s a is 
0.44. To enhance reliability, we consider the average rating 
of each task in the subsequent analysis. On average, teach- 
ers ranked most tasks as ’easy’ (about 55%), fewer as ’mod- 
erate’ (just below 35%), and very few as ‘difficult’ (about 
10%; refer to blue bars in Figure 3). Recall that, according 
to actual success rate, all blue bars would have height 1/3. 
This indicates that teachers tended to underestimate the ac- 
tual difficulty, which may be an instance of the ’expert blind 
spot’, i.e. the phenomenon that experts may fail to imagine 
the difficulties of novices [17]. We will use the teacher rat- 
ings as reference to further validate our item response theory 
model below. 
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Figure 3: Left: The frequency (blue) and the mean actual 
success rate (orange) of tasks rated as easy, moderate, or 
difficult by teachers. Right: The average difficulty parameter 
assigned by the model to tasks rated as easy, moderate, or 
difficult by teachers. 
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Figure 4: Calibration plot of the IRT model. Dashed lines 
indicate bin width. The color indicates how full each bin is. 


4.2 Factors of student success 

In order to investigate which factors contribute to student 
success (RQ1), we trained an item response theory model 
(refer to Section 3.2) on our data. 


Model validation To validate the model, we performed 
three analyses. First, we performed a 10-fold cross-validation 
over attempts, yielding 80.19% + 0.13% training accuracy 
and 77.73% + 1.83% test accuracy on average + standard 
deviation. Because our data is imbalanced (with less fail- 
ures than successes), we also considered AUC (0.86 + 0.001 
in training and 0.82+0.02 in test), and F1 score (0.690.003 
in training and 0.66 + 0.024 in test with a test precision of 
0.73 + 0.06 and a test recall of 0.60 + 0.03). All measures 
indicate good generalization from training to test set. For 
the remainder of this section, we consider a model trained 
on all data. 


Second, we assessed model calibration. Calibration means 
that the predicted success probability of a student corre- 
sponds to the actual success rate [14]. To analyze this, we 
aggregated data into bins according to the predicted suc- 
cess probability (each bin had a width of 10%) and then 
computed the actual success rate within each bin. Figure 4 
shows the corresponding calibration curve, where the dashed 
lines indicate the width of each bin in the analysis. Given 
that the curve remains within the dashed zone, we con- 
clude that our model was well-calibrated. Most predictions 
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Figure 5: The scaled weights of auxiliary features. 


(27.5%) were in the 90% — 100% bin (orange dot), i.e. our 
model predicted successful attempts with high confidence. 


Third, we compared the difficulty parameters of our model 
with the human ratings from Section 4.1. Figure 3 (right) 
displays the average difficulty parameter assigned by the IRT 
model for each difficulty class. We observe that tasks rated 
as more difficult by teachers were also rated as more difficult 
by the model. Tasks rated as ’easy’ by the teachers have a 
mean difficulty parameter of 0.5, tasks rated as ’moderate’ 
have a mean difficulty parameter of 2, and tasks rated as 
difficult’ a mean parameter of 3. 


Overall, we note that the model is reasonably accurate, well- 
calibrated, and agrees with teacher ratings of difficulty. 


Factors to Success Next, we inspect the weights of our 
model to infer which features are predictive of student suc- 
cess. To make the weights comparable, we normalized the 
auxiliary features to the same scaling as the binary features. 


Regarding auxiliary features (Figure 5), we observe that the 
number of prior failures had a low negative weight, i.e. it is 
not predictive of student success. This is likely explained by 
the design of the MyMi.mobile VM. On a failure, students 
only learned that they were wrong but not where the right 
answer might be. This ensures that students can not get 
the right answer by trial and error. Attempt duration also 
had a low negative weight. This may be because duration 
is an ambiguous feature. Students may take longer both for 
productive reasons — e.g. inspecting the slide in more detail 
to validate the image against the definition of the structure 
— and unproductive reasons — e.g. being distracted. Ac- 
cordingly, duration may not provide predictive information 
either way. 


By contrast, we obtained positive scaled weights for the suc- 
cesses (0.39) and explored (0.47) features. The explanation 
for the former is obvious: If you have found the correct 
solution for the task once, chances are you memorized the 
location and can find it again. An explanation for the latter 
is that having seen an annotated example of the structure 
helps to find another instance of it in a structure search. 
That being said: We can not make causal inferences in this 
model. It is also possible that students who are more likely 
to succeed for other reasons are also more likely to consult 
the exploratory slides. On the other hand, we account for 
a general underlying student ability via the student ability 
parameter (Figure 6). 


We observe that student ability parameters vary in the range 
from —1.97 to 1.55 and most parameters are clumped around 
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Figure 6: The success rate vs. the student ability of structure 
searches. Each dot represents a student. Color indicates the 
number of attempted structure searches by a student. 
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Figure 7: The success rate vs. the task difficulty of structure 
searches. Each dot represents a task. The heatmap colors 
indicate the number of attempts of a given task. 


0 + 0.54 (Figure 6). We also observe that the correlation 
between the ability parameter and actual success rate is rel- 
atively weak (Kendall + = 0.52). To further investigate 
the role of student ability, we performed another 10-fold 
cross validation over students instead of attempts, i.e. we 
tried to generalize to students that the model had never 
seen before and who thus had an ability parameter of 0. 
In this setting, we still obtained an average training accu- 
racy of 80.19% + 0.13% and an average test accuracy of 
77.93% + 1.83%, indicating that the student ability param- 
eters contributed little to an accurate prediction. We have 
two possible explanations for this finding: First, it may just 
be that the underlying ’true’ student ability is relatively uni- 
form because almost all students were in the same semester 
at the same two universities. Second, student ability may 
change during usage of the microscope, such that a single 
parameter may not be able to capture student ability par- 
ticularly well. 


Finally, we find that the task difficulty had the clearest re- 
lation to success compared to the other features. As shown 
in Figure 7, parameters range from —2.22 to 3.19 (mean 
0+1.19) and anti-correlate very well with the actual success 
rate (Kendall r = —0.91). This indicates that tasks had a 
roughly consistent difficulty across students. It also explains 


how our IRT model generalized well to new students. 


In summary, we observe that prior success on the same task, 
having seen the corresponding exploratory slide, and task 
difficulty were most predictive of student success, whereas 
student ability was only moderately predictive and prior fail- 
ures as well as duration were not predictive. 


4.3 Typical mistakes 

To investigate RQ2 and RQ3, we consider the two most at- 
tempted structure search tasks, namely searching for the 
nucleus of a myoepithelial cell and searching for an apoc- 
rine gland in human armpit tissue (refer to Figure 8 left and 
right, respectively). 


The myoepithelial cell search (Figure 8, left) was the hardest 
task in the whole dataset with only 16.87% correct guesses 
(shown as green dots), with a difficulty parameter of 3.19, 
and unanimous consent of all four experts that it is diffi- 
cult. Figure 8 (left) illustrates why the task is difficult: The 
correct regions (in green) are small and hard to spot. 


By contrast, the slide for the apocrine gland task (Figure 8, 
right) exhibits many and large correct regions. Accordingly, 
57.72% of guesses were correct (green dots), the model as- 
signed a lower difficulty rating (1.28), and all experts agreed 
that this task is easy. 


To identify typical mistakes, we trained a 10-component? 
Gaussian mixture model to cluster all the mistake locations 
(shown as blue dots). The cluster means are plotted as or- 
ange shapes in Figure 8. Interestingly, most clusters for the 
myoepithelial cell search task, namely the orange squares in 
Figure 8 (left) could plausibly be cell cores of myoepithe- 
lial cells. The bottom-most orange diamond is also located 
near a correct region. Only the remaining orange diamonds 
are clearly wrong because they are not located at cell cores. 
Generally, many students seemed to have a correct under- 
standing of the structure to be found but failed to spot un- 
ambiguously correct locations. 


By contrast, the cluster means for the apocrine gland search 
(Figure 8, right) indicate deeper misconceptions. All cluster 
centers are clearly wrong. More specifically, the diamond 
in the bottom right corresponds to an eccrine instead of 
andocrine gland, and the center diamond corresponds to a 
broken structure. 


In both tasks, we can use cluster centers as a tool to find 
typical misconceptions that need to be discussed in class. 


5. CONCLUSION 


In this paper, we investigated three research questions re- 
garding structure search tasks in virtual microscopes, namely 
1) Which features predict student success? 2) What are typ- 
ical locations of student mistakes? 3) What are underlying 
misconceptions explaining these locations? 


?We observed that only little improvement in Bayesian infor- 
mation criterion could be achieved for more than 10 compo- 
nents. We also observed that 10 components were sufficient 
such that some components ended up unused in Figure 8. 
For other slides, different numbers may be needed. 
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Figure 8: Students’ correct (green) and wrong (blue) guesses on structure searches for myoepithelial cell cores (left) and apocrine 
glands (right). Correct structures are outlined in green. The centers of mistake clusters are orange shapes. 


To answer the first question, we trained a generalized item 
response theory (IRT) model, obtaining 77% accuracy and 
0.82 AUC in 10-fold cross-validation as well as solid cal- 
ibration. Of the features considered, we found that task 
difficulty was particularly predictive of student success and 
the obtained difficulty parameters aligned well with actual 
student success rates and expert ratings. We observed less 
predictive value of student ability, illustrated by the fact 
that IRT models could generalize without loss of accuracy 
to new students. Moreover, prior success on the same task 
and having seen an annotated version of the same histolog- 
ical slide were predictive of success, whereas prior failures 
and duration spent on the task were not. This is interesting 
because it suggests that time stamps could be removed from 
the data, enhancing the privacy of the system. 


Regarding the second and third research question, we ap- 
plied clustering on mistake locations and interpreted the 
cluster centers in terms of misconceptions that may have led 
students to wrongly click at these locations. Such miscon- 
ceptions can then be discussed in class to improve students’ 
learning, or can be used to provide adaptive feedback in the 
virtual microscope tool. 


Overall, this work represents the first step towards educa- 
tional data mining on virtual microscope data with results 
that can be used to improve virtual microscope education, 
e.g. by ordering structure searches according to difficulty, by 
discussing typical misconceptions in class, and by enhancing 
annotations. Further work remains to be done, though. In 
particular, more features should be included to both enhance 
accuracy and find educational interventions that support 
student performance (like the exploratory view). Further, 
one could include relations between tasks in the model, thus 
identifying tasks that share an underlying skill, and extend 
the analysis to more advanced knowledge tracing methods. 
Finally, convolutional neural networks could be utilized to 
generalize teacher annotations and to identify regions of im- 
ages that are easy to confuse with a structure to be searched. 
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ABSTRACT 


The ability to spell correctly is a fundamental skill for participating 
in society and engaging in professional work. In the German lan- 
guage, the capitalization of nouns and proper names presents major 
difficulties for both native and nonnative learners, since the defini- 
tion of what is a noun varies according to one’s linguistic perspec- 
tive. In this paper, we hypothesize that learners use different cogni- 
tive strategies to identify nouns. To this end, we examine capitali- 
zation exercises from more than 30,000 users of an online spelling 
training platform. The cognitive strategies identified are syntactic, 
semantic, pragmatic, and morphological approaches. The strategies 
used by learners overlap widely but differ by individual and evolve 
with grade level. The results show that even though the pragmatic 
strategy is not taught systematically in schools, it is the most wide- 
spread and most successful strategy used by learners. We therefore 
suggest that highly granular learning process data can not only pro- 
vide insights into learners’ capabilities and enable the creation of 
individualized learning content but also inform curriculum devel- 
opment. 


Keywords 


Student strategies, Learning type, Online learning, German lan- 
guage, Spelling, Learning analytics 


1. INTRODUCTION 


The German language is known to be difficult to learn not only for 
nonnative speakers but also for native speakers who struggle with 
orthography [26]. However, a high degree of orthographic compe- 
tence is crucial for successful communication with authorities and 
for professional success, as studies on employers and personnel se- 
lection show [21, 27]. 


One of the many peculiarities in the German language is capitali- 
zation. While nouns and proper names are generally capitalized, 
there are different linguistic perspectives on which words are con- 
sidered nouns. Subsequently, learners can apply various redundant 
strategies to identify nouns. 
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Previous research has further indicated that these cognitive strate- 
gies for capitalization result in different patterns of errors that can 
be distinguished from each other [18]. While some learners con- 
sider the entire phrase when deciding whether to capitalize a word, 
others focus on only the word itself, especially the word ending, as 
an indication of the correct capitalization. Other learners use the 
words’ meaning or take a pragmatic approach. 


This paper aims to contribute to a better understanding of learners’ 
cognitive strategies while processing capitalization tasks in Ger- 
man spelling courses. To this end, we use anonymized learning data 
on capitalization from the online platform orthografietrainer.net. 
The dataset consists of 9,647,385 single exercises completed by 
30,658 users. 


Identifying learners’ cognitive strategies for capitalization tasks 
can enable educators and learning platforms to offer individualized 
help. Moreover, it can improve learning success by informing the 
implementation of personalized adaptive learning environments. 
Furthermore, comparing the predominant cognitive strategies in 
our large dataset to widely taught strategies in school can help in- 
form future curriculum development. Previous studies of textbooks 
show that the set of rules taught in school contains semantic, mor- 
phological, and syntactic properties but almost completely lacks 
pragmatic strategy instruction [20]. Nonetheless, we found strong 
evidence that the pragmatic perspective is the major approach used 
by students of German. 


In summary, we study the following research questions: 


RQ 1: Which cognitive strategies for capitalization are used 
by learners in grades 5 to 9? 

RQ 2: How does the use of capitalization strategies differ by 
grade level and gender? 

RQ 3: How do the predominant capitalization strategies used 
by learners compare to the strategies taught in school? 


To answer the research questions, we proceeded as follows: The 
words used in the capitalization exercises on the online learning 
platform were manually one-hot encoded with 18 grammatical fea- 
tures associated with the four cognitive strategies for capitalization. 
In the next step, the four cognitive strategies for solving capitaliza- 
tion tasks were modeled as decision trees. Subsequently, the results 
of the four decision tree models were compared word by word with 
the solutions of more than 30,000 users. 


2. RELATED WORK 
2.1 Grammatical and cognitive approaches to 


German noun capitalization 
The German orthographic system is complex and difficult to mas- 
ter. In contrast to other European writing systems, the difficulties 
relate less to spelling and more to the indication of grammatical 
structures. This can be illustrated by the capitalization of nouns, a 
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peculiarity of the German spelling system. Unlike in many other 
languages, in German, all nouns are capitalized. The ostensibly 
simple spelling rule that “nouns have to be capitalized” forces the 
speller to define precisely what is considered a noun and what is 
not. On closer examination, this question has a variety of very dif- 
ferent possible answers. 


On the one hand, there are many obvious nouns, such as people, 
places, things, and proper nouns. However, beyond that, every part 
of speech in German can be formally or functionally transformed 
into a noun. This is sometimes recognizable by a change in suffixes 
(cf. Ex. 1). In other cases, it can only be inferred from the syntactic 
context, for example, when articles or prepositions are added (cf. 
Ex. 2). 


Ex. 1: fahren(V) — der Fahrer (N) 
to drive > the driver 

Ex. 2: fahren(V) — das Fahren (N) 
to drive > the driving 


The situation is further complicated by idiomatic expressions that 
formally contain a noun but that, from a pragmatic point of view, 
have lost their nominal characteristics (cf. Ex. 3). For instance, the 
supposed noun in Example 3 can still be formally complemented 
by an adjective, but this otherwise typical procedure for nouns is 
contrary to what a native German speaker would say. For this rea- 
son, the capitalization of such phrases is highly controversial in or- 
thographic theory [5] and is a common source of error among stu- 
dents. 


Ex. 3: im Allgemeinen 
in general > 


— but not: im *haufigen Allgemeinen 
in *common general 


Consequently, all of the nouns in the first sentence of Jane Austen's 
“Pride and Prejudice” can be identified as nouns across several lin- 
guistic levels and work in both English and German: 


Ex. 4: “Ut is a truth universally acknowledged that a sin- 
gle man in possession of a good fortune must be in want 
of a wife,” 

Ex. 5: “Es ist eine allgemein anerkannte Wahrheit, dass 
ein Junggeselle im Besitz eines schénen Vermogens sich 
nichts mehr wiinschen muss als eine Frau.” 


Four of the nouns in the sentence occur with articles and in a typical 
syntactic environment for nouns. In addition, "man" and "wife" are 
identifiable as nouns by their concrete semantics. The words "truth" 
and “possession” are also marked morphologically since they were 
derived from the adjective "true" and the verb “to possess” with the 
help of a derivative ending. 


The difficulties of coherent noun definition thus lie in the fact that 
a term may have a different extension depending on the linguistic 
perspective, although the semantic, morphological, syntactic and 
pragmatic perspectives agree in regard to a broad core of words. On 
the periphery, however, different perspectives lead to different con- 
ceptual boundaries and, consequently, to different orthographic de- 
cisions. The ground truth for what constitutes correct writing is 
therefore a mix between these different perspectives and defined by 
the Council of German Orthography [15]. 


The teaching of these different perspectives has been shown by an 
analysis of different textbooks [18]. The author found that capitali- 
zation is practically always introduced semantically. With the be- 
ginning of grammatical education in later primary school classes, 
morphological properties of nouns are added (especially the prop- 
erty numerus and some typical derivational endings), and articles 
as typical noun companions are introduced. At the secondary level, 
this knowledge is supplemented more systematically by further 


morphological and, especially, syntactic properties of the noun 
group (e.g., gender, case, other determiners besides the article). 
However, in all courses, noun identification is based exclusively on 
formal-grammatical grounds. Only one of the textbooks examined 
also refers to pragmatic properties of the nouns [20]. 


Miiller [20] demonstrated that errors in capitalization correlate 
strongly with different linguistic perspectives. Thus, some learners 
are apparently guided more by semantic aspects and others more by 
morphological, syntactic or pragmatic factors. These findings pro- 
vide a starting point for our study, in which we attempt to model 
the different perspectives on noun capitalization using learning an- 
alytics methods to test whether different learning types can be dis- 
tinguished. 


2.2 Cognitive strategies of capitalization 

Very little literature exists on the differentiation of orthographic 
strategies. Theoretical models [16, 24] distinguish between lexical 
and syntactic approaches, which roughly correspond to semantic 
and morphological strategies on the one hand and syntactic and 
pragmatic strategies on the other. Studies on the success of both 
approaches [28] have been limited to very small corpora and have 
produced partly contradictory results. The proposal of a division 
into four individual strategies was made by [19], who also found 
initial indications of different error profiles on the basis of an em- 
pirical study. 

According to our linguistic considerations, the investigation is 
based on four theoretically distinguishable capitalization strategies: 


The semantic strategy capitalizes words that have a concrete mean- 
ing. This strategy is primarily taught in early elementary first grade: 
“Things that can be touched have to be capitalized.” 


Katze, Hand > 
cat, hand = 


but not: *nacht, *meinung, 
night, opinion 
The morphological strategy is to capitalize words that are classified 


as nouns because of the type of word and the word ending (word 
derivation): 


Laufer > 
Runner i 


but not: (das) *laufen 
(the) running 


The syntactic strategy is to capitalize words that occur in a typical 
nominal syntactic environment, preferably in combination with at- 
tributes, articles or other determiners. 


Die (totale) Dunkelheit > but not: *dunkelheit 
angstigt mich. 


The (total) darkness > Darkness frightens me. 


The pragmatic strategy is to capitalize words that are used in the 
current discourse like a nominal unit, which does not apply to all 
nouns. Pragmatically proper nouns can be supplemented with at- 
tributes or substituted with pronouns, which is often not possible 
with nouns in fixed phrases. 


Der Grund > 
the ground > 


but not: im *grunde 
in the ground 
(saying for: “basically”) 


Typically, the use of several strategies leads to success. Further- 
more, there are many words with which capitalization errors are 
made only very rarely, for example, articles, prepositions, and pro- 
nouns. There are only a few words where using only one strategy 
leads to the correct result, and these words are not representative in 
the German language. Nevertheless, learners apply the strategies to 
different degrees and thus arrive at different results. 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 567 


2.3 Spelling error analysis 

We identify the learners’ use of the previously introduced cognitive 
strategies through the analysis of error patterns. The analysis of 
spelling errors can help in understanding students’ cognitive ap- 
proaches to assignments [2]. Many studies use spelling error anal- 
yses to gain knowledge about second language learners; for exam- 
ple, studies [3, 4] analyzed spelling errors of native Arabic speakers 
in English courses or programs. Others investigate special subpop- 
ulations, as the authors of [2, 23] did with dyslexic learners. In ad- 
dition to differences between native language and foreign language 
learning and between subpopulations, there are different classifica- 
tion schemes for spelling errors. Some authors have used Cook’s 
classification from 1999 [9], which differentiates between omis- 
sion, substitution, addition, transposition and sound-based errors 
(3, 4]. There are major differences between writing systems, and 
Abu-rabia [1] showed that these differences also affect spelling er- 
rors. For the German language, Landerl and Wimmer [14] used the 
“phoneme distance score” as a scoring method for spelling errors. 
Defior and Serrano [10] divided Spanish spelling errors into seven 
different classes of errors, which consist of substitutive spelling, 
partial spelling, random letters and nonorthographic spelling. 
Czech spelling errors were divided into phonological errors on the 
one hand and orthographic, morphological, grammatical, and lexi- 
cal errors on the other hand [7]. The information gained about the 
learners can later be used in adaptive environments for different 
educational approaches to best address each student's abilities [11]. 


2.4 Learner-Level Adaptation 

Adaptive learning environments aim to improve learning success 
by building personalized models of each student's knowledge, pref- 
erences and difficulties [6, 12]. The goal of such an adaptation is to 
individually optimize the learning path for each student [17]. This 
can lead to higher motivation, less overload and frustration, and, 
thus, better results [17]. Personalized adaptation to the student's 
needs can appear in a variety of forms, including task sequencing, 
intelligent solution analyses and problem-solving support [6]. The 
adaptations and the subsequent assessment of adaptive learning en- 
vironments use a range of different data [8]. The parameters used 
most often in learner-level adaptation are parameters that refer to 
the user him or herself and his or her profile as a learner to optimize 
content [17, 22]. The learner profile consists, among other compo- 
nents, of the learner’s behavioral pattern, learner preferences, cog- 
nitive traits or learning style as well as performance data [8, 17]. 
The learner’s behavioral pattern can be analyzed by tracing his or 
her activities on an online platform. Learner preferences thus basi- 
cally describe learners’ preferred materials [22]. Another approach 
is to adapt a system based on learners’ cognitive traits. These traits 
are their cognitive abilities, for example, their working memory ca- 
pacity, abstraction ability or analysis ability [22]. There are various 
definitions of learning style. However, they all agree that there are 
different ways that learners experience learning [13]. Fang et al. 
[11] also differentiated between features of a learner’s interaction 
with a system and individual differences between learners in terms 
of, for example, skill and knowledge. 


All this information can be used by teachers to gain a better under- 
standing of their students, leading to opportunities to adapt their 
teaching, materials or tests [13]. In addition, learners can be pro- 
vided with appropriate materials and tasks that meet their needs. 
Finally, learning styles differ in terms of the sequencing of tasks 
[13]. The relationship between learning styles and the structure of 
the learning material has been investigated by, for example, the au- 
thors of [25], who found that students whose learning styles and 


multimedia preferences match the material in their online course 
have higher scores. 


In the context of this article, we suggest using the information 
gained about cognitive strategies for capitalization to display 
matching tips on online spelling platforms and to evaluate the dif- 
ficulty of an exercise task in terms of which strategy is used. 


3. DATASET 


3.1 Orthografietrainer.net platform 

The learning platform orthografietrainer.net offers online exercises 
for improving German spelling skills, including exercises on capi- 
talization, punctuation, and spelling. The platform provides imme- 
diate and extensive individual feedback, which is impossible in a 
classroom setting. The training platform is built based on the peda- 
gogical assumption that spelling requires not only knowledge but 
also skills. Thus, the focus is not on the regular learning of rules but 
on repeated practice [18]. 


The platform offers material for three different user groups: teach- 
ers, students and guests. Teachers register themselves and their en- 
tire class. They assign appropriate tasks to their students, who work 
on the tasks. Teachers can view their students’ results via a dash- 
board. Additionally, any interested person can log in as a guest and 
complete tasks and tests. 


A special exercise form on the platform is the competence test, 
which determines competence levels in capitalization, punctuation 
and separated or combined spelling. Any identified knowledge 
gaps are visualized, and appropriate exercises are suggested. A pre- 
test, an intermediate test and a posttest are available and show im- 
provements made over time. For this study, we use only data from 
competence tests on capitalization, not regular training data, as the 
test’s standardized structure allows for better comparison. Moreo- 
ver, in competence tests, all sentences are new to users. 


3.2 Description of the dataset 

For this paper, anonymized, event-level competence test data from 
orthografietrainer.net from April 1, 2020 to November 17, 2020 are 
used. Each answer to a sentence corresponds to one record in our 
dataset. During the analyzed time period, schools in Germany, Aus- 
tria and Switzerland were closed for several weeks due to the 
COVID-19 pandemic. In this period, 46,356 users visited the online 
platform and completed a total of 65,645 capitalization task ses- 
sions. When processing the tasks, nearly 50% of the sentences were 
answered incorrectly, which means that the answers each contained 
at least one mistake. 


The platform was heavily used during the first wave of the COVID- 
19 pandemic. During the German school holidays in July and Au- 
gust, there was less practice; online training activity increased again 
in autumn. 


The dataset contains information about the class level and gender 
of users. The German school system includes grades | to 13, with 
1 being the youngest children and 13 being the oldest (Figure 1). 


We decided to exclude all users in grades | to 4, as those learners 
are not well represented in the data set and the difficulty of the cap- 
italization exercises is not adjusted for them. Students in grade 10 
and above are also excluded. Older students who are still assigned 
capitalization exercises are well behind the average learning path 
and thus represent a marginal group that would bias the data. Most 
of the users are in grade 7. Our dataset contains slightly more girls 
(51%) than boys (49%). 
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Figure 1. German School System (simplified) 


3.3 Using decision trees to replicate different 


cognitive strategies 

The capitalization of German words depends on various grammat- 
ical categories, such as the beginning of sentences, word types and 
clauses. The 180 sentences in the competency test that deal with 
capitalization contain 2679 words that begin with either lowercase 
or uppercase letters. These 2679 words were manually categorized 
into 18 grammatical categories. After one-hot encoding of the la- 
bels, we obtained a data frame with dimensions of 2679 x 58. 


It cannot be assumed that people use only one strategy; instead, it 
is likely that each person uses different manifestations of a variety 
of strategies. To be able to analyze the students’ adoption rates of 
the strategies in regard to capitalization, we first needed to gain in- 
sight into how a student would process a word if he or she were 
only to use one strategy and had only one preferred learning type. 


For this purpose, decision trees were used to replicate the four cog- 
nitive strategies by attributing to them only the grammatical fea- 
tures corresponding to the given strategy. Afterwards, the sentences 
from the competence tests were predicted by the decision trees and 
then validated to determine whether the user classified the words 
correctly or incorrectly in terms of capitalization. This provided us 
with the error profiles that would result if only one of the four strat- 
egies were used to decide on proper capitalizations. Table 1 shows 
the strategies and their grammatical features. 


Table 1. Strategies with grammatical features 


Strategy Grammatical features 


Syntactic Clause, Article, 2nd person, 
Determinator, Is prefix, Attribute, 
Complement of a prepositional phrase, 
Beginning of sentence, Core nominal pronoun 
Semantic Concrete, Polite form, Semantic word type 
Pragmatic Theme-Rheme, Attributable, 


Proper name, as an attribute not separable 


from noun sequence 
Morphological | Word type, Noun ending 


In the decision trees, 77 % of the words were processed correctly 
by all four strategies. These are mostly words, where users make 
only a few mistakes, such as articles, pronouns, prepositions, and 
conjunctions. They are not interesting for further analyses, as they 
do not provide insights into differences between the strategies. The 
beginnings of sentences are also filtered out because they are a spe- 
cial case and cause bias in the data: in the structure of the exercises 


Age School Class level on the online platform, the beginnings of sentences are in upper 
18 13 case letters per default. Students rarely click on such words to 
17 Secondary School (Second Phase) D change the letter to lower case. However, as the beginning of a sen- 
16 ri tence is a syntactical feature, only the syntactic strategy processes 
5 ie these words correctly. Keeping the sentence beginnings part in the 

dataset would lead to bias, as most users would have a high adop- 
a : tion rate of the syntactic strategy precisely because the sentence be- 
8 ae 
5 Secondary School (First Phase) 5 ginnings are correct by default. 
d1 6 
10 5 Table 2. Percentage of correct words per strategy 
9 4 Syntactic | Semantic Morphological Pragmatic 
: Erimany sehoe! 3 74,11% | 32,14% 62,05% 79,69% 
: (Sometimes extended to Grade 6) : 


The distribution of the remaining 448 words shows that strategies 
have different success rates (Table 2). The semantic strategy only 
processes approximately 30% of the words correctly, while the syn- 
tactic and pragmatic approaches are much better. This is not sur- 
prising, as the meaning of a word is less informative for the deter- 
mination of capitalization than its grammatical use in a sentence. 


Table 3. Sample of a merged data set 


Word | User | Suc- | Syn- Seman- | Mor- Prag- 

ID ID cess | tactic | tic pho- matic 
logical 

255 452 | 1 1 1 1 0 

256 128 |0 1 0 0 1 

257 427 | 1 0 0 0 1 


In the next step, user data and the results from the decision trees are 
merged. The resulting data frame contains a word processed by a 
user in each row. For each word, there is information on whether 
the user capitalized the word correctly and how the decision tree 
models processed the item. Table 3 shows a sample of the resulting 
data set. In total, there are 1,355,641 records from more than 30,000 
users. 


4. RESULTS 


To answer the first research question “Which cognitive strategies 
for capitalization are used by learners in grades 5 to 9”, we compare 
users’ error profiles with the error profiles of the decision tree clas- 
sifiers. That for, we calculated the percentage of answers that 
matched. The adoption rate was calculated by dividing the sum of 
matching responses by the sum of processed words for each cogni- 
tive strategy. In this calculation, we did not consider whether the 
word was capitalized correctly. Instead, the result expresses only 
whether the words were processed in the same way by a user and 
by one of the four models. 


Table 4 presents the average adoption rate per strategy in percent- 
ages. The models implementing the syntactic, morphological and 
pragmatic strategies were in alignment with the users’ answers for, 
on average, 65% to 72% of the words. However, the result for the 
semantic strategy matched only approximately 40% of users’ an- 
swers. 


Table 4. Adoption rate by strategy 


Semantic 
39,92% 


Syntactic 
65,33% 


Morphological 
66,32% 


Pragmatic 
72,27% 
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When interpreting the results, it must be considered that several 
strategies can be used simultaneously when answering a task. This 
is always the case if the word cannot be answered exclusively by 
one strategy. Thus, overall, the adoption rate is over 100%. 


4.1 Success rates 

Thus far, we have primarily discussed the adoption of the four cap- 
italization strategies. Now, we will examine the successful applica- 
tion of strategies for determining correct capitalization. Figure 2 
shows the correlation between the adoption rates and the success 
rates per strategy. 


Strategy = Semantic 


Strategy = Pragmatic 
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Figure 2. Correlation of success rate and adoption rate by 
strategy 


The adoption of pragmatic, syntactic and morphological strategies 
led to increased success rates. The correlation is strongest for the 
pragmatic strategy. In contrast, the higher the share of words solved 
in agreement with the semantic strategy, the lower the success rate 
was. These correlations also exist when grade levels are considered 
in isolation. The success rates of the different strategies are also 
similar across grade levels. 


The success distributed by class level and gender shows that stu- 
dents in higher grades tended to have lower success rates (Figure 
3). While grades 5 to 7 had similar success rates, these declined 
from grade 8 onwards. The lowest success rates were found in 
grade 9. There is a very small difference between male and female 
success rates; however, in grades 7 to 9, male students correctly 
capitalized fewer words than female students did. It is possible that 
the data in these years reflect cognitive strategy shifts and corre- 
sponding temporary uncertainties. 
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Figure 3. Success rates by class level and gender 


4.2 Distribution by class level and gender 

The second research question “How does the use of capitalization 
strategies differ by grade level or gender” is addressed by Figure 4. 
Looking at the distribution of the average adoption rate by strategy 
and grade level, we see that preferred strategies evolve over time 
and shift according to gender. 


Strategy 
—— Syntactic 
0.7 —— Semantic 
2 — Morphological 
a 06 — Pragmatic 
5. Gender 
2 == jj 
S05 m 
0.4 = Re pe oe ee ee ee ee eee ee 
5 6 8 9 


7 
Class level 


Figure 4. Distribution of adoption rates by class level and gen- 
der 


The rate of adoption of the pragmatic strategy is very high from the 
beginning until it decreases sharply after grade 7 for girls and after 
grade 6 for boys. This is interesting, as the pragmatic strategy is the 
only strategy that is not explicitly taught in school even though it is 
very useful for determining correct capitalization (Figure 2). The 
pragmatic strategy is only surpassed in frequency by the syntactic 
strategy in grade 9, and the latter increases in use with every grade. 
Although the use of the syntactic strategy increases more for girls 
than for boys, in both cases, it ends up being on par with the prag- 
matic strategy. 


Apparently, this reflects stronger grammatical skills among older 
students. Learners often start a second foreign language in grade 7 
(usually Spanish or French), which increases the need for under- 
standing grammatical concepts that are less explicit in their first 
foreign language, English. At the same time, usage of the morpho- 
logical strategy also decreases from grade 7 onwards (as early as 
grade 6 for boys). These findings fit the students’ learning biog- 
raphy, as grammatical instruction progresses from morphological 
to syntactic issues, and therefore orthographic instruction focuses 
on morphological strategies first. The adoption rate of the semantic 
strategy decreases until grade 7 but then increases again. This fits 
with the results regarding the success rate of the semantic strategy, 
which shows a weakening of knowledge from grade 8 onwards. The 
increase in semantic strategy use thus goes hand in hand with the 
students’ lower success rates. 


Looking at the differences in gender, we have already seen in Fig- 
ure 3 that boys in grades 7 to 9 answer fewer words correctly than 
girls. If we now look at the use of the strategies by boys and girls 
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in Figure 4, we see that boys, especially in grades 7 to 9, use the 
semantic strategy more frequently, which is the least successful 
strategy and whose use correlates negatively with the success rate. 
Girls, on the other hand, use the other three strategies more fre- 
quently during this period, which correlate positively with the suc- 
cess rate. 


In summary, we can again identify a difference between the seman- 
tic strategy and the other strategies. Even though the semantic ap- 
proach is taught first, most learners do not adopt it for subsequent 
learning. The adoption rate of the pragmatic and morphological 
strategy decreases, while the syntactic strategy adoption rate in- 
creases. However, the pragmatic approach, which is rarely taught, 
is applied most frequently. 


5. DISCUSSION 


We used event-level learning data from an online spelling trainer 
to analyze cognitive strategies used by students for processing Ger- 
man language capitalization tasks. We built four decision tree clas- 
sifiers to model capitalization strategies that use only syntactic, se- 
mantic, morphological or pragmatic features. As expected, as 
grammatical information in language is redundant, models often 
produce overlapping results. We compared the models’ output to 
user error profiles. We found that the strategies are adopted to dif- 
ferent degrees and that strong correlations—both positive and neg- 
ative—between the adoption rates of strategies exist. 


Furthermore, the distribution of adoption rates by grade level shows 
that strategies are represented among older and younger teenagers 
to different degrees. This variation by grade level is particularly in- 
teresting when compared to the rules taught at school, which an- 
swers the third research question “How do the predominant capi- 
talization strategies used by learners compare to the strategies 
taught in school?”. The first capitalization strategy taught at school 
is the semantic strategy: things that can be touched have to be cap- 
italized. Even though this is taught first, students follow it only 
partly—and rightly so, as the semantic strategy is the least success- 
ful in determining correct capitalization. The pragmatic strategy 
(capitalizing a word if it occurs in a typical textual context for 
nouns), however, is the only one that is not taught explicitly in 
school. Nevertheless, this is the strategy with the highest adoption 
rate and with the highest success rate in our research. The syntactic 
strategy presupposes a deeper understanding of grammar than the 
semantic and pragmatic strategies and thus increases with grade 
level. Although the syntactic strategy and the grammatical 
knowledge required for employing it begin to be introduced in 
grade 5, it is only later that students apply it. This may be because 
students! actual understanding of German grammar increases when 
they begin learning a second foreign language in grade 7. Since 
many grammatical concepts are not present in English, a deeper en- 
gagement with grammar might only begin when students begin 
learning a second foreign language. This could lead to a different 
way of looking at spelling, which is then reflected in the use of the 
syntactic strategy. The use of the morphological strategy decreases 
over time as the use of the syntactic strategy increases. 


When considering the success rates in combination with the adop- 
tion rates, it is particularly interesting that the semantic strategy 
adoption rate correlates negatively with success rate. This again 
shows that the teaching of the semantic strategy as the basic rule 
does not lead to success. The strongest positive correlation with the 
success rate is the pragmatic strategy adoption rate. 


6. CONCLUSION 


In this paper, we have contributed to three aspects of learning ana- 
lytics. We have identified cognitive strategies of learners using er- 
ror analyses, compared adoption rates and drawn conclusions for 
curriculum development from the results. 


First, we were able to model cognitive strategies for solving Ger- 
man language capitalization tasks. The four strategies (syntactic, 
semantic, morphological and pragmatic) do partially overlap. We 
have shown that the different learning strategy adoption rates can 
be observed in user error profiles (RQ1). This opens up opportuni- 
ties for individualized training and therefore for higher motivation 
and learning success for students. 


Second, we found that learners prefer different strategies depending 
on their grade level and gender (RQ2). This information can be 
used to adapt the online platform orthografietrainer.net to various 
learner levels. For example, based on this information, the diffi- 
culty of the words can be calculated more specifically for each user, 
and task sequencing can be adjusted to be neither too difficult nor 
too easy. This reduces the potential for frustration caused by tasks 
that are too difficult and also increases motivation. Furthermore, 
with tasks that represent typical sources of error for a user, the plat- 
form could display appropriate tips and hints. If the error analysis 
results are made available to the teacher on the dashboard of the 
online platform, he or she can see which rules have not yet been 
observed by the students and can adapt lessons accordingly. Further 
research could include the implementation and subsequent valida- 
tion through A/B testing of such improvements. 


Finally, our findings lead to a better understanding of how capital- 
ization is learned and taught (RQ3). Our research shows that there 
is a great discrepancy between which strategies are taught in class 
and which strategies are used by students. We therefore suggest that 
highly granular learning process data can not only provide insights 
into learners’ abilities and enable individualized learning content 
but also inform curriculum development. 


Other future analyses could investigate whether the learning strate- 
gies can be applied to other grammatical areas, such as separated 
and combined spelling. 
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ABSTRACT 


Digitalization and automation of test administration, score 
reporting, and feedback provision have the potential to benefit 
large-scale and formative assessments. Many studies on automated 
essay scoring (AES) and feedback generation systems were 
published in the last decade, but few connected AES and feedback 
generation within a unified framework. Recent advancements in 
machine learning algorithms enable researchers to develop more 
models that explore the potential of automated assessments in 
education. This study makes the following contributions. First, it 
implements, compares, and contrasts three AES algorithms with 
word-embedding and deep learning models (CNN, LSTM, and Bi- 
LSTM). Second, it proposes a novel automated feedback 
generation algorithm based on the Constrained Metropolis- 
Hastings Sampling (CGMH). Third, it builds a classifier to 
integrate AES and feedback generation into a systematic 
framework. Results show that (1) the scoring accuracy of the AES 
algorithm outperforms that of state-of-the-art models; and (2) the 
CGMH method generates semantically-related feedback sentences. 
The findings support the feasibility of an automated system that 
combines essay scoring with feedback generation. Implications 
may lead to the development of models that reveal linguistic 
features, while achieving high scoring accuracy, as well as to the 
creation of feedback corpora to generate more semantically-related 
and sentiment-appropriate feedback. 


Keywords 


Automated essay scoring, deep learning, feedback generation, 
assessment, machine learning, natural language processing 


1. INTRODUCTION 


Automatic essay scoring (AES), the task of machine-grading essays 
or constructed-response items, has been gaining attention due to 
technology-powered advances in educational assessment [16]. The 
goal of AES is to produce reliable and valid scores using machine 
scoring rather than human scoring [43]. Previous research has made 
advances in automatic grading essays with handcrafted features 
[16, 30, 40]. Currently, with the availability of large volumes of 


trainable corpora extracted online and the development of models 
for word representation in the Natural Language Processing (NLP), 
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deep learning approaches have produced highly reliable scores 
using text classification methods [10, 12, 22]. However, few studies 
have approached automated essay scoring and automated feedback 
generation to achieve a fully automated computer-based testing 
system (CBT). 


Earlier attempts at implementing feedback have been made using 
real-time online tutoring by humans [18, 31]. Findings show that 
human tutoring is effective at improving students’ performance, but 
it is time consuming and labor intensive. Also, human tutoring is 
not applicable to large-scale practice and open-ended platforms 
with large numbers of students. Research on automated feedback 
generation emerged in the last decade to fill this gap by developing 
tools to scaffold students within computer-based _ testing 
environments [24, 36]. Previous studies have focused on generating 
formative feedback using rule-based approaches [3, 38]. Although 
rule-based feedback generation is relatively easy to achieve and the 
generated sentences can be considered to be appropriate feedback, 
this approach is usually restricted to pre-designed templates. 
Recent efforts have been made to engage students in more 
communicative and adaptive environments and to propose 
feedback-generation frameworks using sentence generation with 
constraints, where the constraints are often defined by domain- 
specific terms [8, 11]. Nevertheless, few studies have empirically 
examined automated language generation in CBTs. This study 
proposes a framework that introduces an algorithm based on deep 
learning models with an unsupervised sentence-generation 
approach to automatically grade essays and to generate feedback. 


2. RELATED WORK 
2.1 Automated Essay Scoring 


Automated essay scoring constitutes the task of automatically 
assigning scores to written essays based on features or 
characteristics in the text. Several systems for Automated Essay 
Scoring (AES) have already been developed and used in large-scale 
high-stakes assessments for several decades. Page [33] designed 
the first intelligent scoring system, Project Essay Grade (PEG), 
using simple linear regression with hand-crafted features such as 
essay length and proposition counts to perform text classification 
tasks based on these features. Since then, other systems for 
automated essay scoring emerged such as Intelligent Essay 
Assessor [25], e-rater [6], IntelliMetric [41], and My Access! [41]. 
Several AES methods have been later adopted to make predictions 
on student writing scores. Yannakoudakis et al. [44] approached 
AES as a preference-ranking problem and evaluated essays based 
on pairwise comparisons of features, such as POS n-grams features 
and complex grammatical features. Gierl et al. [16] demonstrated 
the application of AES in medical exams with Support Vector 
Machine (SVM). Phandi et al. [34] approaches AES with Bayesian 
Linear Ridge Regression. Taghipour and Ng [40] designed an 
‘Enhanced AI Scoring Engine’ (EASE) based on four genres of 
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features: length-based features, Parts-of-Speech (POS), word- 
prompt overlap, and bag of n-grams. These features were fed into 
several model architectures such as Convolutional Neural Network 
(CNN) and the Recurrent Neural Network (RNN)-variants, namely 
Long Short-Term Memory (LSTM) and Bidirectional LSTM (Bi- 
LSTM), to compare the prediction performance of the models. 
Phandi et al.’s [34] attempt to employ deep learning models to 
predict essay scores was later used as a baseline for related studies. 
Previous studies on automated essay scoring rely heavily on hand- 
crafted feature engineering and knowledge on linguistic discourse. 
Inspired by recent advances of deep learning models and word 
embedding techniques, a substantial body of literature has emerged, 
which has contributed to applying deep learning methods in 
automated essay scoring tasks. Alikaniotis et al. [1] implemented a 
two-layer bidirectional LSTM_ with — score-specific word 
embeddings to learn essay representations and conduct AES tasks. 
The proposed model outperformed the baseline SVM model. Later, 
Dong and Zhang [9] established a three-layer model architecture 
combining CNN for character representation and LSTM for 
sentence representation with an extra attention-pooling layer, 
which performed better than Taghipour and Ng’s [40] model and 
their two-layer CNN model. Taghipour and Ng’s attempt of 
combining feature engineering and deep learning models inspired 
later trials of applying word embeddings and deep learning 
methods on AES. 


2.2 Automated Feedback Generation 


Providing feedback is a key ingredient in performance 
improvement. In education, feedback is defined as the information 
provided by an agent regarding aspects of one’s performance or 
understanding [17]. High-quality personalized and timely feedback 
can improve learners’ performance [17], but feedback provision is 
often reported as the long-standing weakness of ITSs and 
computer-based assessment systems [27]. On the one hand, 
students complain that they receive too little quality feedback in the 
process of learning [5, 13]. On the other hand, students are reported 
to misuse and abuse the feedback or hints provided by the ITSs 
[37]. Thus, knowing how and when to provide real-time 
personalized feedback that guides and motivates students’ learning 
remains a challenge. 


Williams and Dreher [42] advocated the potential of fully- 
automated systems that perform both scoring and feedback 
provision with machines in tasks such as essay grading. Previous 
efforts have been made to produce feedback in intelligent tutoring 
or assessment systems [7, 19, 21] for various disciplines, such as 
computer science [14, 23], information and communication 
technology (ICT [7]), and English as a Second Language (ESL 
[26]). However, most automated assessment systems adopt a 
template-based method to generate feedback [4, 26, 42], which 
usually produces feedback that is limited to fixed expressions. 


Recent advances on constrained sentence generation shed some 
light onto flexible feedback generation. For example, Su et al. [39] 
proposes a Gibbs sampling method to meet the constraints of 
sentiment control. However, Gibbs sampling is not able to vary the 
sentence length or handle keywords when generating the sentence. 
Miao et al. [32] extends the Gibbs sampling to a novel unsupervised 
sampling approach, named Constrained Generation by Metropolis- 
Hastings sampling (CGMH). The CGMH is a subtype of the 
Markov Chain Monte Carlo (MCMC [15]) methods. The CGMH 
allows for more flexible operations on word tokens in a sentence 
space, thus it is easier to generate content with constraints and 
varying sentence lengths. Miao et al. [32] tested the CGMH on 
three tasks including key-to-sentence generation with hard 


constraints, paraphrase, and error correction with soft constraints. 
The CGMH method outperformed state-of-art sentence-generation 
algorithms. Yet, one of the reasons that the research on automated 
feedback generation with NLP is lagging behind may be that there 
is no publicly-available feedback corpus. 


2.3 Present Study 


We propose an AES and an automated feedback generation 
framework to support students’ performance. Specifically, the 
current study implements three deep learning models for automated 
essay scoring: (1) CNN; (2) CNN and LSTM; and (3) CNN and Bi- 
LSTM. In addition, a novel unsupervised sentence-generation 
approach uses CGMH to automatically provide feedback for test 
takers based on their predicted essay scores. The remaining sections 
are guided by the following research questions: 


1. To what extent can the AES algorithms generate accurate 
performance on essay scoring? 


2. To what extent can the CGMH algorithms generate fluent and 
semantically-related feedback? 


The contributions of the present study are three-fold. First, the 
study advances computer-based testing by incorporating automated 
feedback generation into the assessment framework, especially for 
unstructured text (e.g., essays). Second, the flexible unsupervised 
learning approach creates a corpus of semantically-related and 
sentiment-appropriate feedback for scaffolding. Third, the scalable 
automatic assessment and feedback provision system is automated 
and performs accurately, which paves the way for future 
implementations of feedback generation for various domains 
within intelligent tutoring systems. 


3. METHOD 


3.1 Datasets and Corpus 

The dataset for the AES task was retrieved from a Kaggle challenge 
named Automated Student Assessment Prize (ASAP) sponsored by 
the Hewlett Foundation in 2012 and detailed in Table 1. 


Table 1. Summary of ASAP dataset 


Prompt Genre Grade Training Score Ave 
Level set size Range | Length 

1 persuasive 8 1783 2-12 350 
/narrative / 
expository 

2 persuasive 10 1800 1-6 350 
/narrative / 
expository 

3 source 10 1726 0-3 350 
dependent 

4 source 10 1772 0-3 150 
dependent 

5 source 8 1805 0-4 150 
dependent 

6 source 10 1800 0-4 150 
dependent 

7 persuasive 7 1569 0-30 250 
/narrative/ 
expository 

8 persuasive 10 723 0-60 650 
/narrative / 
expository 


The ASAP is the benchmark dataset for piloting AES studies. It 
consists of 8 prompts and 4 genres, including persuasive, narrative, 
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expository, and source-dependent responses. In total, 12,979 essays 
were released. Since the ASAP has not made the official test sets 
publicly available, we used 60% of the training set for training, 20% 
for validation, and 20% for testing. We first performed text 
cleaning, tokenization, and padding. Then, we used Stanford’s 
publicly-available GloVe 300-dimensional model to conduct word 
embeddings [35]. The GloVe 300-dimensional embeddings were 
trained on 6 billion words scraped from Wikipedia and other web 
texts. The writing prompts have different score ranges as shown in 
Table 1. To address the issue of inconsistent score ranges, we 
followed Phandi et al.’s [34], Taghipour and Ng’s [40], and Dong 
and Zhang’s [9] method by approaching the AES as a regression 
task, rescaling the essay score to [0, 1] in the training, validating, 
and test stages, and projecting the scores back to their original 
scales in the evaluation stage. 


The corpus used to train language models for sentence generation 
consisted of the publicly-available IMDB dataset, which contains 
25,000 positive reviews and 25,000 negative reviews. The dataset 
was split into three parts: the training set consisted of 20,000 
negative reviews and 20,000 positive reviews, the validation test 
consisted of 1,250 negative and 1,250 positive reviews, and the test 
set consisted of 1,250 negative and 1,250 positive reviews. A third- 
party corpus, the Reuters corpus from NLTK, was used for 
evaluation of the quality of the generated sentences. 


3.1.1 AES Step 

The present study was conducted in two steps. Step 1 addressed 
automated essay scoring, whereas Step 2 addressed automated 
feedback generation. An essay performance classifier was added to 
synthesize the two steps into a unified framework. 


In the AES task, three deep-learning algorithms were implemented 
and compared regarding their performance and efficiency to select 
the optimal algorithm as the foundation of the feedback generation 
step: CNN, CNN + LSTM, and CNN + Bi-LSTM. The 
convolutional layer is seen as a function that could learn features 
from n-grams, and can be represented as: 


Z,= §(w,|x/ : Ee + bz), 


where x; is the ith embedded word, W, is the weight matrix, b, is 
the bias vector, hy is the window size of the convolutional layer, f 
is a non-linear activation function (i.e., sigmoid, tanh, or ReLu), 
and Z; is the output of feature representation. 


LSTM is an RNN model for processing sequence data [20]. The 
unit or memory cell of LSTM consists of an input gate, a forget 
gate, and an output gate to control information flow. The gates 
decide preserving, forgetting, and passing information as a vector 
sequence at each time step. 


More specifically, assuming there are T sentences in an essay in 
total, the composite functions at sentence f can be written as: 


ip = OWS, + Ujhy_y + bj), 

fr = o(Wys, + Uphy_1 + de), 

Jt = tanh(W,s; + Ught-1 + bg), 
Ce = 1 OGe + feCe-1; 

O, = o(Wosz + Ught_-1 + Do), 

h, = O,©tanh (c;), 


in which s; is the input vector, h; is the output vector, W;, W;, W,, 
W,, Uj, Ur, Ug, Uo are the estimated weight matrices, and b;, br, 


bg, bo are the bias vectors. 


Bi-LSTM is an extension of unidirectional-LSTM for deeper 
representations. Compared with unidirectional-LSTM that can only 
preserve and pass information from history, Bi-LSTM can also 
make use of information from future. In AES tasks, Bi-LSTM could 
process the words in the input vector in both a forward and a 
backward manner. The composite function for Bi-LSTM is similar 
with LSTM: 

M= Wyn i) + by. 
t 


The summary of the model architectures for the three models is 
shown in Table 2. 


Table 2. Model architecture summary 


Layer Hyperparameter Value 
CNN 

Embeddings dimension 300 
Convolutional filters, kernel size 100, 5 
CNN + LSTM 

Embeddings dimension 300 
Convolutional filters, kernel size 100, 5 
LSTM units 32 
CNN + Bi-LSTM 

Embeddings dimension 300 
Convolutional filters, kernel size 100, 5 
Bi-LSTM layers 16 


3.1.2. Feedback Generation Step 

The feedback generation phase included two steps. In Step 1 
(Corpus Development), we will develop a corpus of feedback using 
CGMH based on the expert-derived essay descriptors. In Step 2 
(Feedback Generation & Provision), we will develop feedback 
based on the essay scores provided by the AES algorithms. 


Table 3 shows the part of the essay-scoring rubrics (the score 
ranged from | to 6) and descriptors developed by experts. 


Table 3. Sample descriptor for Essay Prompt 1 


Score Descriptors 


1 An undeveloped response that may take a position but 
offers no more than very minimal support. 


Element Contains few or vague details. 


Is awkward and fragmented. 


May be difficult to read and understand. 


May show no awareness of audience. 


2: An under-developed response that may or may not take a 
position. 


Element Contains only general reasons with unelaborated and/or 


list-like details. 


Shows little or no evidence of organization. 
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May be awkward and confused or simplistic. 


May show little awareness of audience. 


To expand the corpus, we will adopt the Constrained Sentence 
Generation by Metropolis-Hastings Sampling method (CGMH 
[32]) to perform unsupervised paraphrase generation. The CGMH 
facilitates the generation of content with constraints and varying 
sentence lengths. Miao et al. [32] tested the CGMH on three tasks, 
including keywords-to-sentence generation with hard constraints, 
paraphrase, and error correction with soft constraints. In the present 
research, we will implement unsupervised paraphrasing to augment 
the feedback corpus. Specifically, we will first train a language 
model based on the IMDB review corpus [29]. The IMDB dataset 
consists of 25,000 positive and 25,000 negative movie reviews. It 
was selected for the feedback-generation task for the following 
reasons. First, to date, there is no database of academic feedback, 
the IMDB was the closest commentary corpus available. More 
importantly, this corpus is split into positive and negative phrases, 
which makes it domain-independent. Thus, it can transfer more 
easily to other domains. Then, we will perform the paraphrase 
generation. 


A Markov model is used to train the language model on the selected 
corpus. The Markov Chain is commonly used to model natural 
language as a function of the probability that a word appearing in 
position n is only dependent on the previous ze [1, n-1] such that: 


P(W1, Wa, «»Wn)= P(W1) P(W2|Wy),....-- » P(Wn|Wr—z) + »Wn—1)s 


where p(W4,W2,...,Wn) refers to the probability of a specific 
sentence based on the trained corpus, that is, the joint probability 
of all words within the sentence. In the present research, we used 
forward-backward dynamic programming to train the language 
model. 


In Step 2 (feedback paraphrase), we performed the CGMH task of 
unsupervised sentence paraphrasing. The CGMH is concerned with 
a goal of stationary distribution that defines the sentence 
distribution sampled from the corpus and three actions, namely, 
replacement, insertion, and deletion. Specifically, w(x) was set as 
the distribution from which we plan to sample sentences, where x 
denotes a particular sentence and x refers to the feedback template 
that is fed to the algorithm at time step 0. The MH sampler either 
accepts or rejects a word from the given distributions m(x) to 
finally form a desired joint distribution of all words based on a 
predefined stationary distribution. The process is intuitive, as it 
mainly involves two actions: accepting or rejecting a word 
monitored by the acceptance rate a: 

a (x!) 9 (Xe-11x") 
a mnt (Xt-1) g(x" |Xt-1) 
At time step ¢t, the word sampling is conducted to update the 
previous state x to a candidate distribution x’ from a proposed 
distribution g(x'|x,-1), where x,_1 refers to the distribution from 
previous step (t-/), thus x’ = x,. Therefore, o determines the 
acceptance or rejection of a sample. In our paraphrase generation, 
the desired distribution denotes the most likely and logical sentence 
related to the original sentence. 


At each step, a selected word in the sentence will be updated by the 
actions such as insertion, deletion, and replacement, randomly, 
where the respective probabilities are [Dinsert» Paelete» Preplace]- At 
the first time step, these probabilities are set as being equal. At the 
following step, if Replacement is applied on a selected word Wy, in 
a sentence x = [Wy,W2,-,Wm—1 Wm» Wm+1)-»Wy], then the 


conditional probability of choosing w,i°” to replace w,, to form 
candidate sentence x’ from x can be computed as: 
Te (Wy,W2).Wm-1Win ” WmtvWn) 


> 
Ywev(W1,W2,.Wm-1WWm+iWn) 


Jrepiace (X'|x) = (Wine |X_m) -_ 


where V refers to the vocabulary, and wy, is the selected word. If, 
on the other hand, Insertion is applied, an additional step of 
inserting a placeholder will be conducted before taking the action 
Replacement, and then a real word will be sampled to replace the 
placeholder token with the Replacement token. Finally, if Deletion 
is applied, the w,, word selected will be deleted, and 
Yaetetion (x |x) =lifx’ = [w4, Way )Wm-1)Wm4ds Wn]; and 0 
otherwise. The detailed settings of the sentence-generation phase, 
including the hyperparameter values determined in the tuning 
process are included in Table 4. 


Table 4. MCMC hyperparameter 


Hyperparameters Value 

Dictionary size 50,000 

Hidden nodes per LSTM layer 300 

Number of steps 50 

Maximum sentence length 50 

Max epoch 30 

Minimum of Sentence Length 7 

Initial action probability [0.3, 0.3, 0.3, 0.1] 


3.1.3 Synthesis 


One important purpose of the present study is to develop a 
framework linking automated essay scoring and automated 
feedback generation. Thus, the study can be decomposed in two 
parts: a supervised text classification task using CNN and RNN 
models and an unsupervised learning paraphrase generation task 
using MCMC sampling method with constraints. In the synthesis, 
a performance classifier was applied to extract feedback that 
corresponds to the score that is assigned by the AES algorithms. 


3.2 Evaluation Metrics 

The objective of the AES training stage is to minimize the mean 
squared error (MSE) between the scores provided by human raters 
and the prediction scores generated by the models. 


In the automated essay scoring tasks, several measures including 
the quadratic weighted kappa (QWK [9, 10]), exact agreement, and 
alternate-form reliabilities [2] have been used to evaluate the 
performance of AES models in previous studies. In the current 
study, we present the results of QWK, which measures the degree 
of agreement between human raters and the machine on one essay 
and can be calculated by: 


dij Wij 91 
xij Wig Ej 
=i)” 
(N-1)? 
human-rated score; j: represents machine-rated score; N: represents 
the score range), O; ; represents the number of essays that receive a 
rating i by the human and a rating j by the machine, and E is the 
outer product of the histogram vectors of the two scores. According 


to Williamson, Xi, and Breyer [43], QWK scores higher than 0.7 
indicate high accuracy. 


QWK = 1- 


where W;; is calculated by Wi; = 


(i: represents the 
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For the feedback generation task, we used several measures to 
evaluate the generated sentences. The first step is concerned with 
language model training, whereas the second step is concerned with 
generating sentences with the MCMC sampling method. More 
specifically, we first reported the training process of the language 
model over epochs. The objective of the first training process is to 
minimize the perplexity of the language model, which can be 
calculated by: 


PPL= 2- nia log p(w) 


where N equals the number of words in the corpus and p(w;) 
indicates the probability of a word appearing in the position. The 
lower the PPL is, the more precisely the corpus is modeled. 


For the generated sentences, we evaluated the model performance 
using two measures. First, we computed the Negative Likelihood 
(NLL) of the sentences to evaluate their fluency using the Reuters 
corpus released by NLTK modules. The lower the NLL is, the more 
fluent the sentences are. Second, we invited two volunteers to rate 
the quality of 50 pieces of feedback in terms of the sentence fluency 
and relatedness at a scale of 0-1, and the higher the scores are, the 
more fluent and related the generated feedback sentences are. 


4. RESULTS 
4.1 Rating Accuracy of AES Algorithms 


The results show that, for Prompt 1, the most accurate algorithm is 
CNN + Bi-LSTM, whereas for Prompts 2 to 8, the most accurate 
algorithm is CNN + LSTM. The average QWK of CNN + LSTM 
reaches 0.734, as shown in Table 5. In general, the models that 
integrate LSTM/Bi-LSTM perform better than CNN. Compared 
with the baseline [34], the CNN+LSTM model in the present study 
performs better on Writing Prompt 1-7, but poorer on Prompt 8. In 
addition, the average QWK of CNN+LSTM also outperforms the 
baseline model [34]. 


Table 5. Comparisons of QWK of the implemented models 


Prompt CNN CNN + | CNN + Bi- | Phandi et 
LSTM LSTM al., 2015 

1 0.81 0.87 0.88 0.76 

2.1 0.62 0.64 0.52 0.61 

2.2 0.51 0.61 0.51 - 

3 0.73 0.63 0.62 0.62 

4 0.83 0.83 0.72 0.74 

5 0.77 0.86 0.76 0.78 

6 0.77 0.85 0.8 0.78 

7 0.72 0.79 0.75 0.73 

8 0.35 0.53 0.54 0.62 
Ave 0.68 0.73 0.68 0.71 


Note: “indicates that prompts 2.1 and 2.2 were combined into a single score. 


Results of the average QWK across genres (e.g., persuasive, 
narrative, and expository) and source-independent writing can be 
found in Table 6, which shows that CNN+LSTM outperformed 
CNN and CNN+Bi-LSTM on both genres. However, the three 
models all performed poorly on the persuasive, narrative, and 
expository criteria. The results are consistent with previous studies 
[34], as the models generally have better predictions on the prompts 
with smaller score ranges. The wide-score range may cause more 


complexities for the training process of deep learning models. In 
addition, previous studies on applying deep neural networks in AES 
yielded similar results showing that models generally performed 
poorly on Prompts 2 and 8 [9, 10, 34, 40]. The present study also 
found that the three deep learning algorithms showed higher 
efficiency on scoring certain types of genres of writing, but less 
accuracy on Prompts 2, 3, and 8. One possible explanation is that, 
for Prompt 2, two domain scores instead of one single global score 
are provided. The inherent inconsistency or low reliability of a 
single human rater’s scoring makes it difficult for machines to learn 
the scoring pattern. While for Prompt 8, the score range is 0 to 60, 
as shown in Table 1. Compared with other prompts whose score 
ranges are narrow (0 to 3 or 0 to 4), this extremely wide range (i.e., 
the categories of the outcome variable) may hinder the learning 
process of deep learning models. 


Table 6. Average QWK across genres 


QWK persuasive /narrative/ source 
expository independent 
(Prompt 1,2,7,8) (Prompt 3,4,5,6) 
CNN 0.601 0.775 
CNN+LSTM 0.688 0.793 
CNN+Bi-LSTM 0.640 0.726 


4.2 Runtime of AES 


Prediction accuracy is of utmost priority in machine learning. 
However, in a fully-automated scoring and reporting system, 
scoring efficiency represented as the time it took to run one epoch 
(i.e., the runtime) also plays an important role. Table 7 shows the 
average runtime for one epoch of the three models: CNN was the 
fastest of the three on average. Therefore, it can be concluded that 
CNN+LSTM has the highest performance, but also has relatively 
high efficiency (i.e., it is the second fastest algorithm of the three). 
Thus, it was chosen as the AES algorithm for the feedback- 
generation step. 


Table 7. Average runtime and memory 


Model Runtime for one epoch N of Parameters 
CNN Sls 5117233 
CNN + LSTM 53s 4988977 
CNN + Bi-LSTM 55s 4986929 


4.3 NLL of Generated Feedback and Human 
Ratings 

For the sentence generation process, the generated sentences were 
the ones with the lowest NLL after 50 steps of running. The 
feedback phrases were generated using a sentence paraphrasing 
CGMH approach before being passed on to the performance 
classifier. The feedback templates were sampled from the ASAP 
rating descriptors and feedback phrases were generated based on 
the language model trained on the IMDB dataset. 


Figure | presents the training process of the language model, and 
Table 8 shows the NLL and human-rater evaluations of the 
generated sentences regarding Fluency and Relatedness on a scale 
of 0 to 1. The higher the scores, the more fluent and related the 
generated feedback sentences. The results revealed that the MCMC 
method is able to generate fluent and semantically-relevant 
sentences. 
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Table 8. NLL and rater evaluation on sentence generation 


Evaluation Methods Measures 
NLL 10.01 
Human Rating: Fluency 0.62 
Human Rating: Relatedness 0.52 


Training Process 


NLL Loss 
no 


1 3 5 7 9 111315171921 2325 
Epoch 


= Backward o==Forward 


Figure 1. Learning curves of the training process of 
the language model (NLL convergence w.r.t. epochs). 


5. DISCUSSION AND CONCLUSION 


This study proposed and implemented a novel framework for an 
automated assessment and reporting framework with a combination 
of supervised deep learning models and unsupervised MCMC 
sampling method. Specifically, this study compared the 
performances of three models, namely CNN, CNN+LSTM, and 
CNN+Bi-LSTM, on AES tasks in the same context. Results 
revealed that CNN+LSTM demonstrated the highest performance 
on the AES tasks among the three algorithms. Moreover, the 
CNN+LSTM outperformed the baseline model on seven out of 
eight writing prompts, which demonstrates the potential of word- 
embeddings and deep learning models on automated essay scoring. 


A recent literature review revealed that text-based feedback was 
more effective in improving performance [28]. Providing feedback 
within digital learning and assessment systems is essential for 
students’ self-directed learning. However, it is laborious to 
manually devise a large amount of expert-derived quality feedback. 
Compared with sentence-generation supervised-learning methods, 
the CGMH sentence-paraphrasing unsupervised-learning method 
can augment the expert-driven feedback template corpus by 
generating feedback phrases with higher efficiency and flexibility. 
Thus, the proposed method is promising in promoting text-based 
feedback generation within automated assessment systems. Results 
of the current study could facilitate future implementations and 
validations of personalized automated feedback provision for ITSs 
and other virtual learning systems. 


6. LIMITATIONS AND FUTURE WORK 


We identified several limitations in the present study. First, this 
study does not empirically validate the AES and the automated 
feedback generation system in educational settings. Future research 
will be conducted to provide empirical evidence on the validity and 
efficiency of the framework. Second, the present framework 
generates feedback using a holistic score for essays. Future research 
will incorporate linguistic components into AES to enhance the 


interpretability of the scoring results and to generate more fine- 
grained feedback. 
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ABSTRACT 


Self-regulated learning (SRL) is a critical 21‘-century skill. In this 
paper, we examine SRL through the lens of the searching, 
monitoring, assessing, rehearsing, and translating (SMART) 
schema for learning operations. We use microanalysis to measure 
SRL behaviors as students interact with a computer-based 
learning environment, Betty's Brain. We leverage interaction data, 
survey data, in situ student interviews, and supervised machine 
learning techniques to predict the proportion of time spent on each 
of the SMART schema facets, developing models with prediction 
accuracy ranging from rho = .19 for translating to rho = .66 for 
assembling. We examine key interactions between variables in 
our models and discuss the implications for future SRL research. 
Finally, we show that both ground truth and predicted values can 
be used to predict future learning in the system. In fact, the 
inferred models of SRL outperform the ground truth versions, 
demonstrating both their generalizability and their potential for 
using these models to improve adaptive scaffolding for students 
who are still developing SRL skills. 


Keywords 


Self Regulation, SMART, Self Regulated Learning, Machine 
Learning, Student Interviews 


1. INTRODUCTION 


In traditional classrooms, most support for acquiring self- 
regulated learning (SRL) strategies comes from teachers, who 
might check in on projects and/or provide advice about next steps 
[33] in order to keep students focused on their end goals. 
However, teachers’ external regulation alone is insufficient to 
encourage educational success [24]; the learner must also develop 
internal regulation schemas. SRL demands may increase when the 
student is completing a project in a computer-based learning 
environment that is no longer teacher-led. The software might 
scaffold learning activities, but identifying the complex behaviors 
involved with SRL is still not a typical function of most 
computer-based learning systems. 


In most computer-based learning environments, learners must 
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control, manage, plan, and monitor their learning [12], ie., 
implement the definitional components of SRL. SRL has 
consistently been shown to facilitate knowledge acquisition and 
retention among learners in a structured and systematic way [12]. 
As such, work has called for a deeper understanding of SRL 
impacts in online learning [1, 8, 37]. 


A range of techniques have been used to better understand SRL 
both in computer-based learning environments (e.g., [1, 5, 12, 
34]) and in other contexts (see [17, 27] for meta-analyses). 
Research in computer-based learning can be split into two groups: 
supporting SRL and detecting SRL behaviors [46]. Supporting 
SRL has taken a number of forms, but in general, these 
approaches typically scaffold students in either their goal-setting, 
self-evaluation, help-seeking, self-efficacy, or some combination 
of these [29]. This might be through verbal prompts (e.g. "Take 
time to read everything,") [7, 22] or more intricate support 
systems [25], such as progress bars [14], or tools such as 
notebooks, that better facilitate student reflection [2, 35]. 


In terms of detecting SRL in computer-based learning 
environments, Azevedo and colleagues have (using MetaTutor) 
considered the role that emotion plays in regulation, posing that 
affect should be considered as we scaffold SRL behaviors [4]. 
Segedy et al. [36] used interaction data and coherence analysis to 
measure self-regulation. Learner behaviors were tracked using log 
files to assess action coherence (i.e., did a student’s actions 
present a coherent strategy relevant to the current tasks), which 
was shown to predict learning. Winne et al. [45] also leveraged 
log data in a scalable system that traces student actions, 
classifying each learning event into SRL categories in order to 
better understand student cognition, motivation, and 
metacognition. We build upon this approach in this work. 


While interaction data has been successfully used to detect SRL, a 
number of researchers argue that this data should not be 
considered in isolation [3, 37, 40]. Instead, we must also consider 
contextual factors and individual differences not easily inferred 
from logs. This work combines interaction data with data from 
targeted in-situ student interviews and student survey data to 
predict SRL as characterized by the COPES and subsequent 
SMART models of SRL [42] (discussed in detail below). We 
examine the impact of SRL on learning, analyzing contextual and 
student-level factors that may influence SRL behavior and 
demonstrating the potential of the latent encoding of SRL for 
identifying students who need further support. 


1.1 Related Works 


At a high level, SRL is a process in which learners take initiative 
to identify their learning goals and then adjust their learning 
strategies, cognitive resources, motivation, and behavior to 
optimize their learning outcomes [11, 42]. First characterized in 
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1989 [47], SRL is now widely acknowledged as an essential skill 
for learning in the modern knowledge-driven society [23]. In 
learning technologies specifically, recent work has called for a 
deeper understanding of SRL and for learning technology that 
supports the development of SRL strategies [1, 3, 8, 20, 37]. 


In order to provide insight into how SRL works, researchers have 
proposed a number of theoretical models (e.g., [30, 47]). Winne & 
Hadwin's model [43], grounded in information processing theory, 
characterizes SRL as a series of events that happen over four 
recursive stages: (1) task definition, (2) goal setting and planning, 
(3) studying tactics, and (4) metacognitive adaption of studying 
techniques. Each stage is then characterized by Conditions, 
Operations, Products, Evaluations, and Standards (COPES). In 
later work, Winne subcategorized the COPES model further by 
detailing five kinds of operations—searching, monitoring, 
assembling, rehearsing, and translating—known as the SMART 
model [42]. 


In the context of educational data mining, we can study SRL by 
measuring these theoretical constructs and studying their 
relationships to each other and to external measures (such as 
achievement). SRL constructs can be measured either online 
(while an activity is happening) or offline (before or after an 
activity) [34]. Offline assessments typically rely on self-report 
questionnaires, but student interviews have also been used. These 
can be implemented either online and offline and can offer 
advantages over questionnaires that may limit students to pre- 
defined answers [16, 40]. 


Trace analysis is perhaps the main approach used (and endorsed 
(37]) to measure SRL online. Traces (such as log data) capture 
learning actions along with additional contextual and timing 
information, providing a detailed window into a learner's 
processes and behaviors [40]. This data can support microanalytic 
approaches, as sequences of actions can be aligned with different 
facets of a self-regulation model [21, 45]. Models _ that 
conceptualize SRL in terms of events or student actions (such as 
the COPES model [43]) lend themselves more to a trace-based 
analysis [42] than to offline measurement. However, many 
researchers argue that trace data should be supplemented with 
additional measurements (e.g., self-reports or think-alouds) when 
measuring SRL [3, 37, 40]. 


1.2 Current Study 


The current study was conducted within the context of Betty’s 
Brain, a computer-based learning environment for middle school 
science. We combine multiple data sources (interaction, surveys, 
and interview data) to analyze SRL patterns through the lens of 
Winne’s COPES and SMART models [42]. 


We first demonstrate that combining features from different data 
sources yields the most successful models of the SMART facets. 
We present a feature analysis to investigate the key interactions in 
each model. We next examine how the different facets of SRL 
influence student learning. We consider not only the ground truth 
calculations of SMART facets but also our predicted models of 
these facets, showing that the latter better predicts future student 
outcomes than the original variables. 


To our knowledge, this work presents the first exploration of how 
student interviews, surveys, and interaction data may be used in 
concert to predict SRL and learning. This approach provides 
detailed insight into how we may best support students in an 
environment where external regulation may be harder to provide. 


2. DATA 


2.1 The Learning Environment 

In this project, we used the learning environment Betty’s Brain. 
This system implements a learning-by-teaching model [9], where 
students teach a virtual agent named “Betty” by creating a causal 
map of scientific processes (e.g., thermoregulation or climate 
change). Betty demonstrates her “learning” by taking quizzes, 
graded by a mentor agent, Mr. Davis. In this open-ended system, 
students choose how to navigate a variety of learning sources, 
how to build their maps, and how often to quiz Betty. They may 
also interact with Mr. Davis, who can support their learning and 
teaching endeavors [10]. 


Betty’s Brain is a suitable environment for examining SRL 
behaviors for two reasons. Firstly, students choose when and how 
to perform each step of the learning process (both their own and 
Betty’s) [20, 33]. Indeed, the pedagogical agents in Betty’s Brain 
are designed to facilitate the development of SRL behaviors by 
providing a framework for the gradual internalization of effective 
learning strategies. Secondly, students’ interactions with Betty’s 
Brain are logged to an online database with detailed timing 
information, enabling the microanalysis of student actions [37] for 
the measurement of SRL behaviors and strategies. 
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Figure 1. Screenshot of Betty's Brain showing a partial causal 
map constructed by a student. 


2.2 Data Collection 


This study examines data from 93 sixth graders who used Betty’s 
Brain during their 2016-2017 science classes in an urban public 
school in Tennessee. The first data collection occurred over seven 
school days. On day 1, students completed a 30-45-minute paper- 
based pre-test that measured knowledge of scientific concepts and 
causal relationships. On day 2, students participated in a 30- 
minute training session about the learning goals and user 
interface. Afterwards (days 2-6), students used the Betty’s Brain 
software for approximately 45-50 minutes each session, using 
concept maps to teach Betty about the causal relationships 
involved in the process of climate change. On day 7, students 
completed a post-test with the same questions as the pre-test. In 
addition to the data described, we also surveyed students on self- 
efficacy [31] and the task value [31]. 


A second data collection period occurred two months later, during 
which students were asked to model the causal relationships 
involved in thermoregulation. This was otherwise identical to the 
first session, but we consider only the learning data (pre — post 
test) from this second scenario (see section 4.2). 
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2.3 In-Situ Interviews 

As students interacted with Betty’s Brain, automatic detectors of 
educationally relevant affective states [19] and behaviors [26], 
already embedded in the software, identified key moments in the 
students’ learning processes, either from specific affective 
patterns or theoretically aligned behavioral sequences. This 
detection was then used to prompt student interviews through 
Quick Red Fox (QRF), an app which integrates interview data 
with Betty’s Brain events. Interviewers sought to take a helpful 
but non-authoritative role when speaking with students. 
Interviews were open-ended and occurred without a set script; 
however, students were often asked what their strategies were (if 
any) for getting through the system. As new information emerged 
in these open-ended interviews, questions were designed to elicit 
information about intrinsic interest (e.g., “What kinds of books do 
you like to read and why?”). Overall, however, students were 
encouraged to provide feedback about their experience with the 
software and talk about their choices as they used the software 


2.4 Interview Transcription and Coding 

A total of 358 interviews were conducted during this study and 
stored on a secure file management system. Interviews were 
manually transcribed by three members of the research team, 
preserving all metadata but scrubbing any identifying information. 


The code development process followed [38]’s 7-stage recursive, 
iterative process: conceptualization, generation, refinement, 
codebook generation, revision and feedback, implementation, and 
continued revision. The conceptualization of codes involved a 
literature review to capture experiences relevant to affect and 
SRL. Using grounded theory [13], we worked with the lead 
interviewer (2nd author) to identify categories that were (1) 
theoretically valid and pertinent to the conditions in the COPES 
model and (2) likely to saliently emerge in the interviews. 


We iteratively refined the coding scheme until the entire research 
team reached a shared understanding. Following the coding 
manual's production, external coders reached acceptable inter- 
rater reliability with the 3" author before coding all of the 
transcripts. All codes had Cohen’s kappa > .6, and the average 
Cohen’s kappa across codes was .83. See Table 1 for details. 


2.5 SMART Encoding 

We operationalized SRL behavior within the log data using the 
COPES and SMART SRL frameworks [42]. In this work, we 
categorize all student actions recorded in the log files as 
“operations” within the COPES model (defined as “cognitive and 
behavioral actions applied to perform the task”). We then evaluate 
these operations using the SMART model, which subcategorizes 
actions by the information taken as input and product generated 
[39]. Specifically, the SMART model presents five primitive 
cognitive operation subcategories: Searching, Monitoring, 
Assembling, Rehearsing, and Translating [39]. Each category is 
briefly described below; for more details, see [39, 41, 42, 45]. 
Examples specific to Betty’s Brain are shown in Table 2. 


Searching is the operation where a learner focuses their attention 
on a knowledge base or resource to update their working memory. 


Monitoring considers two types of information: (1) learner 
perceptions (current understanding, quiz answers, etc.), (2) 
standards for performance. In monitoring activities, the learner 
evaluates their perceptions compared to the standards. 


Assembling involves building a network of internal links between 
acquired information to understand relationships (X precedes Y, 


Table 1. Interview codes 


Code N__ Description 


Helpfulness 51 Utility of system resources for learning, and 
positive evaluations of the resources. K=.643 


Interestingness 11 Interestingness of system resources and 
continued desire to use the platform. x=.726 


Strategic Use 205 Indicates plan for interacting with the 
platform, or changes in strategy or interaction 


based on experiences. «=.911 


Positive Mr. 8 Explicitly mentions interactions with Mr. 


Davis Davis as positive experiences. k=.838 
Attribution 

Positive 26 Explicitly (positively) mentions science in 
Science relation to books, future careers, school 


Attribution subjects, and overall evaluations. k=.837 
Positive 105 Expression of a desire for challenge and that 
Persistence the current task is a challenge; there is active 
pursuit of a goal, and repeated attempts to 
complete a step/problem. x=.911 
Procedural 225 Step by step approach to the learning activity, 
Strategy active use of within-platform tools, reference 
to previous or upcoming step. K=.862 
Motivational 151 Explicit indication of expected outcome from 
Strategy behaviors/actions, explicitly mentions a 
pursuit for mastery, contains positive 
attribution/emotion for completion, and/or 
mentions desire to meet task demands. x=.870 
Self- 174 Positive description of own progress or 
Confidence ability, self-assessments of learning progress, 


willingness to encounter learning challenges/, 
recognition of helpful resources. k=.877 


Y causes Z, etc.). Assembling activities help students to connect 
individual items of knowledge in working memory. 


Rehearsing operations repeatedly direct attention to information 
that the learner is currently working on. These actions reinforce 
the same information and prevent decay in working memory. 


Translating operations reformat information into a new 
representation, providing the potential for alternate interpretations 
and understanding. Examples include converting a diagram to 
plain text or answering a question about a diagram. 


To enable a trace analysis of student SRL patterns [37] we first 
assigned each of the possible student operations within Betty’s 
Brain to one SMART category. We categorized operations that 
added new items to the concept map within Betty’s Brain as 
assembling, and operations that edited existing items as 
monitoring. In ambiguous cases, such as between translation and 
monitoring tasks, we considered student agency. Specifically, 
actions initiated by the system were classified as translating even 
if they had an evaluative component. In our operationalization of 
the SMART model, we found that Betty’s Brain logged no 
rehearsing actions; thus, this category was not analyzed. 


3. MODEL TRAINING METHODS 


We built supervised machine learning models to detect each facet 
of the SMART model. We leveraged a combination of activity, 
survey, and interview data (described further below). 
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Table 2. Example Betty’s Brain actions by SMART facet. 


SMART Facet N Example 


Searching the virtual textbook (initiated by 
the student) 

Monitoring 22 Reviewing and updating the label of a 
causal link (initiated by the student) 
Adding a causal link to the map (initiated 
by the student) 

0 - 

Responding to a system-initiated multiple- 
choice questions (vs. those initiated by the 
student) 


Searching 8 


Assembling 2 


Rehearsing 
Translating 3 


3.1 Features 

We split features into three groups based on their origin. Each 
group is described in detail below. Due to differences in scale, we 
Z-scored each feature prior to model training. 


(Other) Student Activity Features (N = 4). These features 
provide a high-level description of student actions: the raw 
number of student actions, the proportion of links made that were 
ineffective, time spent off-task/idle (as characterized in [36]), and 
number of successful quizzes. These features were designed to be 
more coarse-grained than the log data used to derive the SMART 
variables. None of the fine-grained features used to calculate the 
SMART encoding are included in this feature set. 


Student Interview Codes (N = 9). These were derived from the 
transcribed student interviews (described in section 2.3). In cases 
where students had multiple interviews, codes were averaged to 
provide one feature per code per student. 


Survey Features (N = 2). Survey features come from the two 
survey measures described in section 2.2: self-efficacy and task 
value. While each measure consisted of multiple survey questions, 
both were summarized down to one variable, respectively. 


3.2 Dependent Variables 

We initially considered four dependent variables, the proportion 
of the time a student spent on each of the SMART variables 
discussed in section 2.5. We considered time spent rather than raw 
action counts for a more standardized comparison and to avoid 
misinterpretation. For example, there are more monitoring actions 
than searching actions; however, it is common for students to 
spend considerably more time searching than monitoring. Due to 
time spent idle (at least 30 seconds of inactivity [36]), the sum of 
these four variables for any given student may not be 1. The most 
common category was searching (M=0.65, SD=0.07), followed by 
monitoring (M=0.16, SD=0.06), translating (M=0.10, SD=0.02), 
and assembling (M=0.09, SD=0.04). 


We also considered a second set of dependent variables related to 
student learning. We derived two variables, one for the current 
scenario from which the rest of the data was collected, and one for 
the future scenario. In both cases, learning was characterized by 
post test — pre test. We consider both scenarios to examine how 
well our approach generalizes to future interactions and 
understand how immediate context may influence prediction. 


3.3 Regression Models 

We used scikit-learn [28] to implement Bayesian ridge regression, 
linear regression, Huber regression, and random forest regression, 
and also implemented XGBoost with a separate library [15]. 
Hyperparameters were tuned on the training set using scikit- 
learn’s cross-validated grid search [28] where appropriate. 


All models were trained using 4-fold student-level cross- 
validation and repeated for ten iterations, each with a new random 
seed. For evaluation, predictions were pooled across folds, and 
averaged across iterations. These models then underwent a 
decision tree based secondary analysis, discussed below. 


4. RESULTS 


We compare model accuracy by computing the correlation 
between the model predictions and the ground truth values 
derived from student logs. We measured the Spearman rho 
correlation coefficient in the test folds to evaluate models. In the 
majority of cases, random forest regressors yielded the best 
results. As such, results from these models are reported below. 


4.1 Predicting SMART Operations 

We first consider results predicting the proportion of time a 
student spent on each of the four SMART operations. For each 
operation (i.e., searching, monitoring, assembling, and 
translating), we developed models drawn from _ various 
combinations of our feature types (actions, surveys, and interview 
codes). Thus, we were able to test the modeling potential of seven 
different combinations of features for each SMART operation (see 
Table 3). To provide a point of comparison, we generated a 
chance baseline for each variable by shuffling the ground truth 
values. This allowed us to estimate a random baseline that still 
preserved the original distribution. 


Table 3. Spearman correlations predicting ground truth labels 
of self-regulated learning operations 


20 2 cl 2 
Poff § 
Q a) D = 
Ss ° A s 
Features a S < mS 
Chance Baseline 0.01 0 0.01 0 
Individual Feature Sets 
Student Surveys (Surveys) 0.28 0.29 0.28 0.08 
Student Interviews (Int) 0.31 0.37 0.35 0.09 
Student Actions (Act) 0.27 0.47 0.59 0.11 
Combined Feature Sets 
Int + Surveys 0.35 0.42 0.62 0.13 
Act + Surveys 0.29 047 0.63 0.12 
Act + Int 0.34 30.51 0.64 0.1 
Act + Int + Surveys 0.39 0.55 0.66 0.19 


We note that all models outperformed baseline, and that models 
consistently performed worst at predicting Translating. This may 
be due to the low variance between students as noted above. We 
note that the best model performance was achieved by combining 
the three feature sets (Actions + Interviews + Surveys). This 
suggests that even though these operations are derived from 
student log data, additional context from interviews and surveys 
can improve SRL predictions. 


4.1.1 Feature Interaction Analysis 

Our most successful models were tree-based, meaning that they 
may contain nonlinear relationships that would be unsuitable for 
linear feature analysis. Therefore, we trained one decision tree 
regressor per outcome and examined each tree’s top two levels to 
observe the most important interactions, each of which was 
classified as “High” or “Low.” 


As_ Table 4 shows, Self-Confidence and Self-Efficacy frequently 
occur in these interactions, implying students’ self-regulation 
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Table 4. Top 8 interactions for predicting SMART facets 


Feature 1 Feature 2 Predicted Value 
Low _ Self-Confidence + Low Successful Quizzes = High Searching 
High Self-Confidence + Low Off Task Time = Low Searching 
Low Off Task Time + Low  Self-Efficacy = High Monitoring 
High Off Task Time + High Self-Efficacy = High Monitoring 
Low Action Count + Low  Self-Efficacy = High Assembling 
High Action Count + Low Ineffective Links = Low Assembling 
Low _ Procedural Strategy + Low Self Confidence = Low Translating 
High _ Procedural Strategy + Low Motivational Strategy = High Translating 


hinged on their perception of themselves. For example, students 
with high Self-Confidence who spent less time off task were still 
likely to have lower searching values, ostensibly because they 
may not feel the need to consult external resources. 


4.2 Predicting Student Learning 

Next, we explored how the four SMART facets predicted student 
learning (operationalized as post-test — pre-test) in both the 
current scenario (from which all the data used in the models was 
collected) and then the future scenario (collected in a second 
round of data collection with the same students; see section 2). In 
this future scenario, the content was different (climate change vs. 
thermoregulation), but the software remained the same. 


We consider three feature sets: 1) the three feature sets used in 
section 4.1 combined; 2) the ground truth values for the SMART 
encodings (dependent variables in section 4.1); 3) predicted 
values for each of the SMART operations generated using the best 
models from section 4.1. 


For both learning outcomes, we tested both the Ground Truth 
values collected from the first scenario (i.e., the actual searching 
or monitoring behaviors from that scenario) and Predicted 
SMART values (as predicted by the Act + Int + Survey models 
from the current scenario). This allowed us to examine how data 
collected in the current scenario generalizes to a future learning 
session. 


As Table 5 shows, each learning model outperformed chance, 
demonstrating both predictive validity and generalizability. These 
results also present two findings of note. Firstly, learning models 
constructed from Predicted SMART values outperformed those 
constructed from the Ground Truth SMART values for both 
scenarios. It is possible that our models in fact, smooth over some 
of the noise that is present in the ground truth, thus presenting a 
more robust measure than the raw encodings [6]. 


Second, we note that for the future scenario, the predicted 
SMART values outperform model constructed directly from the 
Act + Int + Survey variables, despite this being the values from 
which the SMART predictions are made. The SMART values 
may provide a latent encoding of this data, which is more 
generalizable than the raw values to future occurrences, however 
further study would be required to confirm this hypothesis. 


Table 5. Spearman rho for models predicting learning gains. 
All features are derived from the current scenario 


Features Current Future Scenario 
Scenario 

Chance Baseline 01 O01 

Act + Int + Survey 45 St 

Ground Truth SMART 21 29 

Predicted SMART 32 A3 


4.2.1 Feature Interaction 

Using the same feature analysis methods described in section 
4.1.1 we again examined the interactions involved when 
predicting learning gains. These results are shown in Table 6. 


We note the need for the balance between SMART operations. 
For example, high monitoring and low translating resulted in 
lower learning on the current scenario, but so did high searching 
with Jow monitoring, suggesting it would be insufficient to simply 
increase monitoring activities; we must encourage more effective 
combinations of operations. Similarly, these results imply the 
need for a careful structure approach to assembling. 


The results shown for the future scenario focus on more 
transferrable features than results for the current scenario. This 
makes sense given that we are no longer considering the 
immediate context. We found that students who had low off task 
time and high persistence in the first scenario were more likely to 
perform well in the second. Students with lower monitoring but 
high translating were likely to have lower learning, indicating it is 
not enough to simply test your knowledge, it is also important to 
review feedback and compare work to standards. 


5. DISCUSSION 


Adaptive learning technology that responds to students’ learning 
patterns can improve both immediate and long-term goals by 
supporting the internalization of appropriate  self-regulated 
learning behaviors. In this paper, we infer SRL using a 
combination of data mining and interviews/surveys. 


5.1 Main Findings 


Automated detection of SRL behaviors poses several challenges, 
as many of the processes it entails are highly internal [42]. In this 
work, we demonstrate that a combination of activity data, data 
from surveys, and student interviews provides a more robust 
prediction of SRL than any individual data stream. We find that 
predicted SRL behaviors (from students’ first system interactions) 
predict future performance. In fact, models based on our inferred 
SRL measures outperform models constructed from the original 
features used to train them (action, interview, and survey data) 
and the SMART ground truth values. This finding is important for 
environments where detailed trace analysis may not be possible, 
but coarser-grained activity can be distilled. 


Further, we show that a balanced combination of SRL behaviors 
is required for successful learning. For example, students with low 
learning are likely not spending enough time monitoring, but 
simply requiring them to check their work more often may not 
create improvement if they have not yet fully assembled the 
knowledge necessary to effectively examine their previous efforts. 
Future work should design scaffolds to create this balance. 
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Table 6. Top interactions for predicting learning. * indicates a predicted value 


Scenario Feature 1 Feature 2 Predicted Value 
High Monitoring* + Low _ Translating* = Low Learning 
Current High Searching + Low Monitoring = Low Learning 
Scenario High Successful Quizzes + High  Self-Efficacy = High Learning 
Low Successful Quizzes + High Ineffective Links = Low Learning 
High Monitoring* + Low  Assembling* = High Learning 
Future Low Monitoring* + Low  Assembling* = Low Learning 
Scenario Low Off Task Time + High Positive Persistence = High Learning 
Low Monitoring + High _ Translating = Low Learning 


These results demonstrate the importance of considering log data 
in the context of other measures when understanding student SRL. 
This, in turn, underscores the need for more automated measures 
of complex noncognitive measures such as self-efficacy and 
persistence. Our work shows that these codes collected from 
interview data boost SRL detection. In order to scale SRL 
detection, we must first consider how we might automate the 
detection of some of the constructs discussed here (see future 
work below). These results offer the potential for designing pre- 
emptive interventions, providing a more informed, asset-based 
intervention as opposed to responding to a negative event. 


5.2 Applications 

The key application of this work is to develop adaptive online 
learning environments that respond to student SRL. As SRL 
detection continues to improve, systems like Betty’s Brain might 
choose from wide range of intervention strategies that have 
already been shown to improve SRL (e.g., discussion in section 
1). For example, once students who are not employing optimal 
strategies have been identified, additional scaffolding tasks might 
be used to encourage new behaviors. Similarly, the software could 
deliver interventions to increase motivation or interest. 


It is important to note that the proposed intervention strategies 
rely on SRL detection, which is likely always to be imperfect. 
Self-regulation is highly internal [32], and as such, it is unlikely 
that we will ever be able to infer SRL perfectly. Any interventions 
should be designed to be “fail-soft” in that there are no damaging 
effects to student learning or future SRL if delivered incorrectly. 


In situations where computer-based learning is being used to 
augment classroom instruction, a further application of this work 
would be in providing feedback to teachers. Such feedback could 
help them dynamically adapt their instruction, as outlined in [18] 
for example, providing real-time feedback or an early warning 
system, etc. 


5.3 Limitations and Future Work 


This work has limitations that should be addressed going 
forwards. Firstly, the SMART features only characterize student 
operations, and they do not give a complete SRL picture. Future 
work should look to combine the SMART framework with the 
broader COPES model [43]. The interview and survey measures 
used in this work may also capture aspects of the cognitive and 
task conditions referred to in the COPES model, but additional 
study would be required to confirm this hypothesis. 


A further limitation is the slightly cyclic nature of using student 
activity features derived from log data, to predict SRL, also 
derived from log data. While we made every effort to ensure that 
our models were not confounded in some way, future work should 
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consider an external measure of SRL for additional validation 
[44]. 


Finally, interview data is time-consuming to collect, limiting 
scalability. In the future, we will employ alternate measures for 
some of the interview codes measured in this work, such as 
student surveys. It is possible that voice recognition and natural 
language processing could be used in the future to support this 
type of data collection. 


5.4 Conclusions 

This paper investigates predicting student SRL behavior in a 
computer-based learning environment from a complex dataset of 
coarse-grained activity data, in-situ student interviews, and 
student surveys. Our analyses indicated that SRL was best 
predicted from a combination of the three feature sets. We found 
our predicted SRL operations were better at predicting future 
learning than their ground truth equivalents, suggesting the 
potential for a smoother latent encoding and better supporting 
students in future endeavors. We envision this paper contributing 
to future technologies that will track and respond to student SRL 
behaviors and create more positive learning experiences. 
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ABSTRACT 


This study examines data from a field experiment investigat- 
ing the effects of a personalized recommendation algorithm 
that proposes to students which videosto watch next, after 
they complete mini-assessments for algebra that available 
on the Math Nation intelligent virtual learning environment 
(IVLE). The end users of Math Nation are students enrolled 
in an Algebra 1 course in middle and high schools of the 
state of Florida, and the IVLE is used both during and out 
of school time. The objective of the developed recommenda- 
tion algorithm is to increase student preparation to take the 
state-mandated End-of-Course (EoC) Algebra 1 assessment 
at the end of the school year. The algorithm is based on a 
Markov Decision Process framework that uses as input the 
students’ responses to a series of mini-assessment tests. The 
current study randomly assigned 16,406 students to either 
treatment or control conditions, which were blind to both 
students and teachers. The results indicate that the effects 
of the recommendation algorithm depend on the level of us- 
age of students, showing significant improvements on EoC 
test scores of students who have a moderate level of usage. 
However, there was no effect for low usage students. The 
study also shows that students practicing with the mini- 
assessments available on Math Nation, helps them improve 
by a small margin their performance on the End-of-Course 
test, irrespective of the usage level. Finally, the study pro- 
vides insights on challenges posed for implementing person- 
alized recommendation algorithms at a large scale, related 
both to student self-regulation and teacher orchestration of 
technology use in the classroom. 
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1. INTRODUCTION 


There is a growing trend in employing intelligent virtual 
learning environments (IVLE) to aid students in improving 
their math performance in K-12 education [8] [26] (23). While 
there is a robust body of literature that shows that students’ 
preparedness together with various demographic and school 
characteristics are key factors for predicting students’ perfor- 
mance in various math tests , [VLE have been viewed as 
an especially promising way of improving students’ achieve- 
ments in mathematics. Given the investment of resources 
into technology products and the time and effort needed to 
integrate them into the curriculum, there has been consid- 
erable interest in determining their effectiveness. A number 
of studies have reported positive effects based both on small 
nal randomized control trials and longer term interventions 

ee [19] [16] [15], as well as based on observational data 

. There have also been a series of meta-analysis stud- 
les e howine that IVLE have substantial effects on student 


outcomes [27]. 


IVLE have the potential of offering personalized learning ex- 
periences. The latter refer to instruction “in which the pace 
of learning the instructional approach are optimized for the 
needs of each learner”, according to the United States Na- 
tional Education Technology Plan 2017. IVLE that offer 
some degree of personalization include Khan Academy at the 
K-12 level and Newton at the higher education level. As dis- 
cussed in 3], at the core of personalized learning strategies 
is a recommendation algorithm aiming to propose appropri- 
ate learning materials and topics to the student at the right 
time, leveraging the student’s prior history of interactions 
with the IVLE. 
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Many personalized learning strategies leverage ideas and 
tools from the field of Reinforcement Learning [9| [20]. 
The key components of a reinforcement learning based al- 
gorithm are the triplet of state, reward, action. The state 
reflects information on the student’s knowledge and skills 
set on the topic(s) under consideration, the reward relates 
to the goals of the strategy (e.g. performance on tests, en- 
gagement with the IVLE, etc.) and the action refers to an 
activity, (e.g. watch a video on a topic of interest, take 
an assessment test, etc.) that, based on the current state 
information, aims to maximize the expected reward. 


This study reports the results of a large-scale randomized 
field experiment that focuses on the impact of a simple per- 
sonalized strategy implemented on the Math Nation IVLE, 
on a high-stakes, state-mandated End-of-Course (EoC) al- 
gebra test. Although many evaluations of IVLE have been 
published, most of them rely on locally developed standard- 
ized tests, rather than high-stakes statewide tests [10]. Math 
Nation, is an online video-based tutoring program aiming to 
prepare students in the state of Florida for the EoC, which 
is required for high school graduation. The platform offers 
videos on various algebra topics recorded by different tu- 
tors, explaining the main concepts and walking the student 
through related examples, 3-question assessments for each 
topic and 10-question assessments for sets of related topics, 
with video explanations for each question. Therefore, stu- 
dents can assess their progress by taking both the short (3- 
question) and the long (10-question) tests. Further, the plat- 
form offers a monitored discussion area, wherein students 
can pose questions to peers and volunteer tutors. Hence, 
at launch time, it shared a number of characteristics with 
Khan Academy, both being self-guided and easy to use on 
an ad hoc basis, without the need for extensive professional 
development training for teachers. The content of the videos 
and assessments are aligned with the curriculum adopted by 
the state and also the content and format of the EoC test. 


A new feature of Math Nation is the introduction of an al- 
gorithm to recommend videos to students, leveraging infor- 
mation on their performance on the mini-assessments asso- 
ciated with each video. Specifically, Math Nation divides 
the whole Algebra 1 course materials into 10 sections. Each 
section is further divided into several topics, thus result- 
ing in a total of 93 topics for the entire course. For each 
topic, there is a tutorial video associated with it, recorded 
by different tutors. At the end of the video the student is 
presented with a 3-question assessment (henceforth called a 
mini-assessment) and based on the score obtained, a video 
recommendation (the action) is offered aiming to maximize 
the student’s expected score (the reward) on these mini- 
assessments. The student can follow the recommendation 
or decide to ignore it and select another video of her/his 
own choice by the same or another tutor. To compare the 
effectiveness of the recommendation algorithm, a “business- 
as-usual” competitor is implemented, which recommends the 
next video in a predetermined sequence related to the struc- 
ture of the algebra state curriculum, irrespective of the score 
achieved in the mini-assessment. 


The objectives of the study are twofold: (i) estimate the 
average treatment effect of the recommendation algorithm 
vis-a-vis its competitor together with its interactions with 


previous achievement and level of usage of the algorithm, 
and (ii) understand the relationship between performance in 
the mini-assessments and the EoC test, after accounting for 
math preparedness and school characteristics of the students 
that participated in this randomized control study. 


The remainder of the paper is structured as follows. Sec- 
tion [2] presents the developed personalized recommendation 
strategy. Section|3|describes in detail the data recorded from 
the algorithm, as well as other covariates used in the anal- 
ysis. Section [4] presents the statistical methods used in the 
analysis and the main results of the study. Finally, Section 
[5] discusses the implications of our findings and suggestions 
to modify the recommendation algorithm. 


2. PERSONALIZED RECOMMENDATION 
STRATEGY 


Next, we describe the data-driven algorithm for recommend- 
ing a suitable tutoring video to each individual student. 
As previously mentioned, the content of the course is di- 
vided into 93 topics, with each topic accompanied by a video 
recorded by 5 tutors in English and 1 tutor in Spanish. Stu- 
dents can freely select the tutor for each video. 


To rigorously set the stage for the video recommendation 
algorithm, fix a single student, and let s; (t) be the corre- 
sponding “mini-score” for topic k € {1,2,--- ,93}, at time 
t= 0,1,---. These mini-scores, representing the knowledge 
level of the student, are obtained by assessing responses to 
the mini-assessments comprising of 4-choice questions, with 
a single correct choice. Thus, the set of possible outcomes 
consists of i correct answer(s), together with 3—i wrong an- 
swer(s), for i = 0,1, 2,3. Then, we center and normalize the 
corresponding scores (henceforth referred to as mini-scores), 
so that on average, simply guessing the answers lead to a 
zero score. Thus, we have sx (t) € {—3,1,5,9}, and if the 
answers are selected completely at random, E [sx (t)] = 0. 


With the above setting, the full state of the student at time 
t is given by S(t) = [s1 (t),--- , 93 (t)’ € {—3,1,5,9}°%, 
93 


while ||S (£)|| = >> sx (t) reflects the (total) score of the 
k=1 

student under consideration at time t. The dynamical model 
for topic k consists of a Markov chain for which the state 
is s, (t). For the time being, suppose that the parameters 
of the Markov chain consisting of 4 x 4 tables of transition 
probabilities among the states {—3, 1, 5,9} are available. We 
will shortly discuss a statistical method leveraging transfer 
learning techniques, for estimating the Markov transition 
kernels according to the observed data. 


The recommendation strategy is to propose to the student 
the tutoring video corresponding to the topic with the largest 
predicted growth in the mini-score. Formally, at time t, the 
IVLE recommends the student to watch the tutoring video 
of topic k*, wherein 


k* = arg max E [sx (¢+ 1) — sx (t) | (s)| , 


where the notation “ | ” is used to indicate a conditional prob- 
ability distribution. The student can either accept the rec- 
ommendation, watch the video and take the mini-assessment, 
or can ignore the recommendation and select another video 
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to watch (by possibly another tutor). 


Note that in order to compute the above expected values, 
for every topic i € {1,--- ,93} it suffices to have only the 4 
probabilities corresponding to the transition of the Markov 
chain from the current state s; (t) to the next one s; (t+ 1). 
Intuitively, the difference quantity sx (t + 1) — sx (t) reflects 
the predicted growth of the student in topic k. Therefore, 
the high level idea of the recommendation strategy is to 
propose to the student to work on the topic (s)he is capable 
of improving her/his knowledge level the most. Therefore, 
the recommendation aligns with Vygotsky’s theory of 
zone of proximal development by providing a video that is 
neither too easy, nor too challenging. Further, the recom- 
mended topic is totally personalized to the student, since 
the state S(t) at time t is unique to each student. 


Finally, we describe the statistical learning procedure for es- 
timating the Markov transition probabilities. For this pur- 
pose, the students are clustered in 12 different groups, based 
on their demographic and other background data, so that 
students of similar learning abilities will be assigned to the 
same group (cluster). The details of the clustering proce- 
dure are provided in Section [3] We assume that students in 
each group share the Markov transition probabilities reflect- 
ing their cognitive responses to watching the tutoring video 
of a specific topic. Thus, in order to estimate the transition 
probabilities for students in a fixed group, we divide the 
total number of transitions between every pair of the possi- 
ble states {—3,1,5,9} in the group, with the total number 
of transitions in the group. We emphasize the following 
points. First, while the Markov transition probabilities are 
the same for all students in one demographic/background 
group, the states are uniquely personalized to each student. 
Second, the estimates of the transition probabilities change 
over time as the platform collects more data from the re- 
sponses of the students to the mini-assessments. Further, 
when Math Nation starts being used by the students, the 
initial estimates of the transition probabilities are selected 
randomly, and are updated throughout the academic year 
as the students continue to use it. Finally, if there is more 
than one k* maximizing the predicted growth, one will be 
selected at random. 


Before the algorithm was deployed within Math Nation plat- 
form, it was extensively tested on synthetic data generated 
based on data collected in previous years from the platform. 
Specifically, students that have used the platform in previous 
years were clustered in 12 groups (see also Section 3) based 
on their demographic and background information. Note 
that the distributions of such data are very similar to those 
in the academic year that the recommendation algorithm 
was launched and evaluated in the current study. Subse- 
quently, the response data to the mini-assessment tests of 
the students within each cluster were used to estimate the 
corresponding Markov transition probabilities. The latter 
were then used to initialize the recommendation algorithm 
and to generate synthetic data for students in different clus- 
ters. The upshot of this analysis was that the algorithm 
required adequate engagement (t > 45) to show significant 
improvement in performance in the mini-assessments. We 
revisit this point in the Discussion section. 


Table 1: Distribution of the students across different Math 
Achievement Levels, School Grades and Student Grades 


Achievement Level No. of Students | School Grades No. of Students | Student Grades No. of Students 


1 473 A 4,377 5 3 


2 1,453 B 2,001 6 1463 
Cc 4580 7 3599 
8 5893 


3 3,487 
4 2,711 
5 2,834 
Total 10,958 | Total 10,958 | Total 10,958 


Note: Data based on previous school year performance 
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Figure 1: Distribution of Pre-Score 


3. DATA DESCRIPTION 


In this study, we randomly assign a sample of 16,406 middle 
and high school students enrolled in Algebra 1 in a large 
school district in the state of Florida, to a treatment (pro- 
posed recommendation strategy) or a control (business as 
usual recommendation strategy) group. The assignment was 
blind to students and teachers. The treatment group re- 
ceived video recommendations as described in the previous 
section, while the control group received a recommendation 
to watch the next video in the curriculum sequence. To 
initialize the recommendation, a randomized cluster design 
was employed. Specifically, students were first matched ac- 
cording to their grade, school characteristics and math pre- 
paredness test scores from the previous school year and then 
randomly assigned to the two groups. The variables used 
for matching purposes were the scores on the state stan- 
dardized mathematics test, called the Mathematics Florida 
Standards Assessmen{'{henceforth, referred to as Pre-Score 
and the corresponding test referred to as Pre-Test), as well 
as an achievement level assigned to them by their schools, 
while the quality of each school is reflected by a grade as- 
signed to it by the state Department of Educatior{’| The 
latter grades are based on several components and have five 
different levels (‘A’ being the highest level and ‘F’ being the 
lowest one). Due to lack of data for many of these variables, 
5,448 students were removed from any further analysis and 
hence Table [I] that shows the distributions of the students 
across different Achievement Levels, School Grades and Stu- 
dent Grades and Figure[1]that depicts the distribution of the 
Pre-Score are based on the remaining 10,958 students. 


‘http: //www.fidoe.org/accountability /assessments /k-12- 
student-assessment /fsa.stml 

“http: //www.fidoe.org/accountability /accountability- 
reporting/school-grades/ 
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Figure 2: Boxplots of clusters hierarchically ordered based 
on Pre-score: For each cluster, School Grade, Achievement 
Level and Student Grade and Cluster Size are reported. 


Using the above information, students were assigned to clus- 
ters/groups. This cluster assignment is used as a categor- 
ical variable in the analysis presented in Section The 
clusters are designed in such a way that each of them cor- 
responds to a group with a unique combination of math 
preparedness and school grade. In summary, the following 
four variables were considered by the clustering algorithm: 
Pre-Score, Math Achievement Level, School Grade and Stu- 
dent Grade. An agglomerative hierarchical clustering algo- 
rithm was employed for this task and using the dendrogram 
with Gower’s distance metric, along with silhouette values 
[7], the number of clusters was chosen to be 12. Figure 
provides a pictorial representation of the key features of the 
clusters. Specifically, for each cluster the Figure depicts the 
boxplot of the Pre-Score and also the corresponding Student 
Grade, Math Achievement Level, School Grade and Size of 
the cluster. For ease of comparison, the clusters are ordered 
according to the distribution of the Pre-Score. Hence, clus- 
ter 1 corresponds to the group of students having the lowest 
Pre-Score, while cluster 12 is the group with the highest Pre- 
Score. As Figure[2|shows, the size of cluster 2 was very small 
and hence it was merged with cluster 1 for the subsequent 
analyses. 


The number of times a particular student takes the mini- 
assessment after watching a video, is defined as the usage by 
that student. F igure[3]|depicts the average usage per student 
for each of the clusters for both the control and treatment 
groups. It can readily be seen that the overall average across 
the study population is 2.88, with many clusters exhibiting 
significantly lower usage. There are also a few clusters ex- 
hibiting high usage; e.g. cluster 5 for the control group and 
cluster 9 for the treatment group. 


4. METHODS AND RESULTS 


The analyses described below, aim to provide answers to 
the two objectives outlined in Section []] In our first analy- 
sis, we estimate the average treatment effect of the recom- 
mendation algorithm on EoC scores, using a simple linear 
regression model, with the following two categorical vari- 


Per student usage across clusters 
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Figure 3: Average usage per student for different clusters, 
for both treatment and control groups 


ables and the interaction between them; (i) the first cate- 
gorical variable TC, comprises of two levels: the first rep- 
resents the Treatment group that watched the personal- 
ized recommended videos and took the corresponding mini- 
assessments, and the second level corresponds to the Control 
group; (ii) the second categorical variable Previous Achieve- 
ment Level, comprises of five categories, each corresponding 
to a different level of achievement in the Pre-test. Level 1 
stands for the lowest achievement, whereas the highest level 
is coded by level 5. Then, the linear regression model with 
the above two predictors and their interaction is given by: 


y=put fi TC + Bo Achievement+ 
63 (['C x Achievement) + € (1) 


where y represents the EoC score and we further assume 
that « ~ N(0,o7). Based on this model, the estimate of 
the average treatment effect of the personalized recommen- 
dation on EoC score, is the coefficient 6; corresponding to 
the variable TC. Further, estimates of standard errors of the 
regression coefficients are based on cluster-robust estimators 
2]. To answer the first research question discussed in Sec- 
tion [1] we test HZ : 6, =0 vs. H?° : B, £0. The coeffi- 
cient (1 is the difference between the mean EoC score of the 
Treatment and the Control group, after accounting for the 
effect of all the other covariates. The estimated coefficients 
(scaled) and corresponding p-values are reported in Table[2| 
Table [2] shows that the achievement levels are statistically 
significant, while the treatment effect (i.e., the impact of the 
developed recommendation algorithm) is not. Further, there 
is a small positive significant effect for the interaction of the 
treatment with Achievement level 2. However, as shown in 
F igure[3| usage patterns vary widely across different groups 
(clusters) of students. 


To that end, and in order to gain a deeper understanding 
of how the average treatment effect behaves across different 
IVLE usage levels, we fit model separately on groups of 
students exhibiting different usage levels. After some initial 
exploratory analysis, we divided the students in approxi- 
mately evenly distributed usage groups as shown in Table 
The results are summarized in Table [| whose first col- 
umn specifies the usage levels of the group. As an example, 
students who have taken at least 10 mini-assessments tests, 
are categorized as a group with usage level 10 or higher. 
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Table 2: Estimated coefficients and corresponding p-values 
for Model 


Variable Scaled Coefficient | p-value 
Intercept 348.45 <0.001 
Treatment -0.99 0.32 

Achievement level 2 11.28 <0.001 
Achievement level 3 23.14 <0.001 
Achievement level 4 35.31 <0.001 
Achievement level 5 50.02 <0.001 
TC*Achievement level 2 1.94 0.05 
TC*Achievement level 3 0.94 0.34 
TC*Achievement level 4 1.25 0.21 
TC*Achievement level 5 1.09 0.27 
Note: The scaled coefficients are obtained by dividing the 


estimated coefficients by their standard error. 


The second and third columns contains= the p-values cor- 
responding to the test 6, = 0 and the scaled version of the 
estimated coefficients, respectively. As it is evident from 
Table [3 the first few rows that correspond to lower usage 
groups, have high p-values and thus the average treatment 
effect is not statistically significant. The treatment effect be- 
comes significant for students who used the platform more 
extensively (> 48). 


The model also controls for the level of achievement of stu- 
dents. Table[4]presents the results for the 62 regression coef- 
ficient for different usage levels. The corresponding p-values 
are given in parentheses. It can be seen that the effect is sta- 
tistically significant (marked in bold font) across almost all 
Previous Achievement levels and usage levels, as expected 
based on the overall results presented in Table Further, 
this result is in accordance with a large body of literature 
that has found a positive association between level of math 
preparation and test scores (see, e.g.,{16] and references 
therein). Further, the magnitude of the coefficient is larger 
for higher achievement levels. 


Model (1) also estimates the interaction effect between the 
treatment and the Previous Achievement level. Table 
summarizes the scaled estimates of the interaction effects 
and the p-values (given in parentheses). Since the Previous 
Achievement level has 5 categories, we obtain the estimates 
for all the levels except the baseline category, i.e., Previous 
Achievement level 1, which is absorbed in the intercept of 
the model. As usage increases, Table[5|displays more signifi- 
cant interaction effects (in bold font) between treatment and 
achievement level as compared to low usage groups. Note 
that due to lack of data in selected categories, some of the in- 
teraction effects could not be estimated and hence left blank. 


Note that most of the interaction effects are not statistically 
significant. There are selected ones with a positive coef- 
ficient, corresponding to higher achievement levels (3 and 
above) for high usage groups (e.g., 33 and 65). Analogously, 
there are selected interaction effects with a negative coef- 
ficient corresponding to the lower achievement level 2, and 
relative high usage level. 


To answer the second research question on the relationship 
between the performance of the students in mini-assessments 
and in the EoC test, we obtain the Average Mini-Assessments 


Table 3: Usage-wise effect of the recommendation: p-values 
and scaled coefficients for different usage levels 


Usage Level | p-value | Scaled Coefficient | Sample size 
9 0.64 0.46 1097 
13 0.40 0.85 932 
27 0.29 1.07 515 
33 0.17 1.39 411 
48 0.02 2.41 254 
52 0.01 2.56 230 
55 0.05 1.93 207 
59 0.06 1.91 183 
65 0.02 2.31 140 
74 0.08 1.78 92 


Table 4: Usage-wise effect of the Previous Achievement 
level: p-values and scaled coefficients for different usage lev- 
els 


Usage Level | Level 2 Level 3 Level 4 Level 5 
Z.65 TAI 10.67 15.76 
9 (<0.001) | (<0.001) | (<0.001) | (<0.001) 
4.94 7.5 10.74 15.37 
13 (<0.001) | (<0.001) | (<0.001) | (<0.001) 
87 2.64 461 6.76 
27 (0.06) (0.008) | (<0.001) | (<0.001) 
2.45 3.02 472 6.19 
33 (0.01) (0.002) | (<0.001) | (<0.001) 
3.23 3.19 4.18 5.41 
48 (0.001) | (0.001) | (<0.001) | (<0.001) 
3.16 3.29 416 0.42 
52 (0.002) | ((0.001)) | (<0.001) | (<0.001) 
2.97 3.07 3.97 5.41 
55 (0.003) | (0.002) | ((<0.001)) | (<0.001) 
2.05 T.92 2.65 3.08 
59 (0.04) (0.05) (0.008) | (<0.001) 
0.13 T.83 415 
65 - (0.89) (0.06) (<0.001) 
0.50 T.34 2.70 
74 : (0.62) (0.18) (0.008) 


Table 5: Usage-wise interaction effect of treatment and Pre- 


vious Achievement level: scaled coefficients (p-values) for 
different usage levels 
Usage Level | TC * Level 2 | TC * Level 3 | TC * Level 4 | TC * Level 5 

“UAT “U-20 “00 “0-06 

9 (0.64) (0.84) (0.69) (0.94) 
-T.08 “0.68 081 “0.34 

B (0.27) (0.49) (0.42) (0.74) 
0.62 19 05 T.60 

27 (0.54) (0.23) (0.29) (0.11) 
0.78 AT 732 TIS 

33 (0.43) (0.14) (0.19) (0.03) 
“ZBI =T-21 ~Z.05 

48 (0.005) (0.22) (0.04) 
“IBS ~Ta7 “ZS 

52 (0.004) (0.14) (0.03) 
~ZAY -LI8 -L73 

55 (0.01) (0.24) (0.08) 
~ZAU “U-71 ~T.68 

59 (0.02) (0.48) (0.09) : 

ZG ZL TBD 
65 E (0.01) (0.04) (0.005) 
“U.98 “1.28 
74 - (0.32) (0.20) 
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Score for each of the ~ 11,000 registered students, wherein 
the average is computed over the mini-scores for all the mini- 
assessments the student has completed. 


Then, the following Analysis of Covariance model is fitted to 
the data. To control for the students math preparedness and 
school characteristics, we include the cluster information as 
a factor in the model. 


Yig = M+ 04 + Baig + iy, (2) 


wherein yi; is the EoC score and 2;; is the Average Mini- 
Assessments Score for the j*” student in the i’” cluster. Fur- 
ther, yz is the overall mean effect and a; is the additional ef- 
fect due to the assignment of the student to the i—th cluster 
that accounts for prior math knowledge, grade and school 
characteristics of the students. 


Table [6] depicts the estimated regression coefficients, their 
standard errors, together with the value of the test statis- 
tic and the p-value corresponding to the significance test 
for each of the coefficients. All p-values are significantly 
smaller than the nominal 0.05 (or 0.01) level, thus indicat- 
ing that the corresponding predictor has a significant effect 
on the EoC test score. The estimated coefficient for the 
mini-assessment is 1.15. This small, but statistically signifi- 
cant coefficient indicates that an increase of one point in the 
average student score on the mini-assessment corresponds 
to an expected improvement in the EoC score of 1.15 (the 
corresponding scaled regression coefficient is 7.46) points. 
At first glance, this relationship between the average mini- 
score performance and the EoC test seems of limited practi- 
cal significance. However, when examining the distribution 
of EoC scores across all students (~ 90,000) that used the 
Math Nation platform at some point in time (not necessar- 
ily participants in the current study), we find that about 
1.9% are within 1 point of the passing threshold. Hence, in 
light of this information, it is reasonable to posit that the 
recommendation algorithm would have been beneficial for a 
good number of students, if it were adopted and used by all 
platform participants. 


Table 6: Results of the Analysis of Covariance model: Re- 
sponse EoC Score; categorical predictor cluster and numer- 
ical predictor Average Mini-Assessments Score 


Coefficients Estimate | Std. Error | t-value | p-value 
Intercept 464.29 2.61 178.18 | <2e-16 
Cluster 3 29.79 3.63 8.21 3.9e-16 
Cluster 4 29.61 2.77 10.70 | <2e-16 
Cluster 5 37.06 2.92 12.70 | <2e-16 
Cluster 6 38.99 3.13 12.46 | <2e-16 
Cluster 7 36.84 277 13.29 | <2e-16 
Cluster 8 49.74 2.92 17.02 | <2e-16 
Cluster 9 53.47 2.87 18.62 | <2e-16 
Cluster 10 73.48 2.90 25.33 | <2e-16 
Cluster 11 68.89 3.21 21.49 | <2e-16 
Cluster 12 75.34 2.88 26.13 | <2e-16 
Avg. Mini-Assessments 1.15 0.15 7.46 1.3e-13 
5. DISCUSSION 


The analysis of the data from the randomized control study 
provide a number of useful insights in designing recommen- 
dation strategies for IVLE. Firstly, the recommendation al- 
gorithm holds a lot of promise, but as it is well known in rein- 
forcement learning, it requires adequate amount of usage to 


“explore” various possibilities in order to maximize expected 
reward. The adequate usage requirement is also discussed 
in the literature evaluating recommendation strategies for 
Massive Online Open Courses; see |6| and references 
therein. As mentioned in Section |3} an initial evaluation of 
the proposed algorithm during its development phase based 
on synthetic data indicated that it starts yielding satisfac- 
tory results, in terms of students improving their perfor- 
mance on the mini-assessments, once students follow its rec- 
ommendations for over 45 times. The results of the analysis 
in Section [4] are in line with the aforementioned finding. As 
Table [3] indicates, the recommendation strategy shows sig- 
nificant impact starting from a usage level of 48. Further, 
note that in our study the primary outcome under consid- 
eration is the EoC test that takes place at the end of the 
academic year, as opposed to a more direct outcome related 
to the recommendation algorithm, such as performance over 
time on the mini-assessment tests. In many studies in the 
literature (e.g., [22], assessment of a recommendation al- 
gorithm was based on more immediate outcomes (e.g., the 
mini-assessments in our setting), as opposed to a more distal 
outcome, such as the EoC. Nevertheless, the results of our 
experiment indicate that with stronger student engagement 
the developed algorithm could be more widely beneficial. 


To address the issue of low usage, a new experiment has 
been designed, wherein the teachers are directly involved 
in the implementation of the recommendation system in the 
classroom, which is expected to yield higher levels of engage- 
ment of students with the IVLE platform. This experiment 
is under way at the time of this publication. 


It is also worth mentioning that our first analysis was of 
“Intent-to-Treat” type, because it evaluated the effect of be- 
ing randomly assigned to treatment or control groups with- 
out consideration of the extent that students used the rec- 
ommendation strategy. On the contrary, traditional Com- 
plier Average Causal Effect analysis is based on 
“Treatment-on-the-Treated” principle, wherein one estimates 
the treatment effect for those who complied with the treat- 
ment. The latter constitutes a direction of future research. 


Another issue of broader interest is that many IVLE recom- 
mendation algorithms are designed to assign test problems 
in an adaptive way, as opposed to assigning videos that Math 
Nation does. However, in the modified implementation of 
the algorithm currently under evaluation, the student can 
skip watching the recommended video and take the mini- 
assessment directly; in case, (s)he gets less than two of the 
questions correctly, the algorithm recommends to watch the 
segment of the video that covers the corresponding mate- 
rial and then retake the mini-assessment. This modification 
aims to enhance the emphasis of the recommendation al- 
gorithm on solving problems, but at the same time enable 
students to review relevant material to questions that they 
answered incorrectly. 
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ABSTRACT 


There are a number of novel exercise types that students can 
utilize while learning Computer Science, each with its own level 
of complexity and interaction as outlined by the ICAP Frame- 
work [10]. Some are Interactive, like solving coding problems; 
Constructive, like explaining code; Active, like retyping source 
code; and Passive, like reviewing slides. To date, there has been 
little research on how students vary their study and engagement 
habits by exercise type and when they do so. In this paper, we 
present our findings on student activity sequences from an online 
professional development course. We isolated student activities 
into sessions and then produced activity transition visualizations 
to compare the behavior of students who complete the course 
to those who do not. We then used multiple factor analyses to 
examine how students transition from one type of activity to the 
next. From this analysis we identified platform silos in student’s 
work. We further expand this concept to the presence of activity 
silos grouping by type. We find that this siloing behavior is 
consistent in both completers and non-completers but is weaker 
for the latter group. Finally, we discuss our findings and how 
instructors and researchers may use this information to ensure 
that students show persistence through practice. 


Keywords 
novel exercises, ICAP framework, study sessions, activity se- 
quences, platform silos, activity silos, student modeling 


1. INTRODUCTION 


CS Education have introduced a number of novel exercise types 
to better scaffold students’ experiences. These include retyping 
source code [14], arranging scrambled code fragments (Parsons 
Puzzles) [21], debugging provided code [8], predicting output [26], 
fill in the blanks [4], selfexplanation [4], and small scale coding 
exercises [2, 12]. Each of these exercise types can also be mapped 
onto the ICAP framework [10]. This framework defines four 
categories of instructional activities based upon students’ level of 
engagement: Interactive, Constructive, Active, and Passive. Pas- 
sive learning includes reading static course materials or watching 
lecture videos. Active learning is described as rehearsing or copy- 
ing solution steps. Constructive learning includes self-explanation 
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of content or creation of novel externalized outputs like summaries. 
Finally, Interactive learning involves directly engaging with a peer, 
agent, or instructor to explore information and receive feedback 
which can be expanded upon. 


While these exercise types have made their way into classrooms 
there is little evidence of how these types of engagement interact 
with one-another. Traditional intervention studies have focused on 
the overall impact of one or more exercise types [20, 19, 14, 8], or 
on the automated selection/recommendation of future excercises 
based upon a student model [27], but not on how students work 
with or across them in the absence of guidance. Nor has this recom- 
mendation work been extended to nontraditional learning contexts. 
Absent an understanding of how students orchestrate multiple in- 
teraction modes we face challenges in scaffolding effective learning 
opportunities and in evaluating the impact of novel learning envi- 
ronments. Providing students with ineffective, or overly complex 
learning opportunities risks trapping them in a fail/skip practice 
cycle that would inhibit any functional learning gains [17]. 


In this paper we report our investigation of how students di- 
rect their practice of CS concepts when presented with a set of 
options. Our study was conducted in the context of an online 
professional development course for Python programming. This 
course is part of a research study funded by the Department of 
Labor to create novel learning pathways for existing technical 
professionals to move into AI and Data-Science areas. We ex- 
tracted students’ practice/study sessions and analyzed the activity 
transitions within each session. We then analyzed these activity 
sequences to answer the following research questions (RQs): 


RQ1 Can we replicate the existence of platform silos introduced 
in [1] with a new dataset? 

RQ2 Are there common activity transitions between students? 

RQ3 How do activity sequences connect with the ICAP Frame- 
work? 

RQ4 How do the practice sessions of completers of the course 
differ from non-completers? 


To answer these questions, we first produced and analyzed 
a set of activity transition diagrams for students in the course 
comparing those who completed the course to those who did not. 
Through this analysis, we confirmed the presence of platform silos 
which we extend this notion to include activity silos, where stu- 
dents primarily focus on a single mode of engagement (consistent 
with ICAP) during a given practice session. When students did 
transition between modes, it was only to move up the ICAP chain 
and never to ‘downgrade’ to a lower level of engagement. We 
support our findings through two different factor analyses, which 
help explain the 42-62% of variance between the sessions. From 
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our results, we found ICAP categories isolated into individual 
sessions, as well as LMS content consumption and quiz taking. 


2. BACKGROUND 


2.1 ICAP Framework 

The ICAP Framework seeks to classify the modes of student 
engagement while engaging in learning activities [10]. These 
categories, Interactive, Constructive, Active and Passive respec- 
tivel. Passive engagement includes activities such as reading a 
text or observing a video. Active engagement includes rehearsing 
steps or copying solutions. Constructive engagement includes 
self-explanation or comparing and contrasting materials. Finally, 
Interactive engagement includes responding and interacting with 
an agent, system, or another person. Tthe framework is hierar- 
chical, suggesting I > C > A > P, or that activities with higher 
levels of engagement promote the greatest levels of learning. 


Chi and Wylie present a literature review of several empirical 
studies supporting their ICAP hypothesis [10]. The first study 
consisted of all four modes of engagement in materials science 
and showed learning improved significantly at a rate of 8-10% 
per mode. They then present two studies which used active, 
passive, and constructive modes in evolutionary biology and plate 
tectonics. Finally, Chi and Wylie present comparisons of two 
modes across note taking, concept mapping, and self-explanation. 
In each of these studies, the results again showed that higher 
modes of engagement had higher learning gains. 


Chi and Wylie’s work, as well as our own personal communica- 
tions with Chi [9], note that identifying the mode of a particular 
activity can be a non-trivial task. For example, a toy example 
task could be presenting steps toward making a peanut butter and 
jelly sandwich given randomly shuffled segments of instructions. 
This task could be constructed as an active exercise if the student 
already knows the recipe and the task is simply picking the appro- 
priate sequence from the list of steps. However, if the student had 
not learned the appropriate order, then the task of figuring out 
the right sequence would be construed as a constructive exercise. 
Computer Science has a similar task, known as Parsons Puzzles, 
that mirrors this toy example, discussed in more detail in the 
following section. 


2.2 Novel Exercise Types 

In this section, we describe eight different exercise type studied in 
Computer Science education, provide some research background 
on the exercise type, and justifications for which ICAP mode we 
will classify them as for of our study. 


2.2.1 Typing Exercises 

Typing Exercises (TE) require students to retype source code 
that has been presented to them [14]. Typing Exercises can be 
used as active learning activities under the ICAP Framework 
as they require students to retype verbatim the code presented 
to them. Previously, we presented images of source code and 
showed that self-selected students that completed optional typing 
exercises earned higher course grades and submitted less code 
with build failures. Leinonen et. al. also presented optional typing 
exercises to students before programming tasks [19], but were not 
able to find the same results as ours. However their study only 
lasted two weeks and many of their selected participants did not 
attempt the exercises at all. 


2.2.2 Fill in the Blank 


Fill in the Blank (F%itB) exercises remove a small portion of code 
from a snippet and asks students to ‘fill in’ the blank. Students 


need to have an understanding of the snippet as a whole to 
deduce what needs to be included at a particular blank location. 
Reviewing incomplete worked examples reduces ineffective self- 
explanations and enhances the transfer of learned materials [4, 
5]. Based on the results from Atkinson et. al., we consider 
FitB-style exercises to require lower levels of engagement than 
self-explanation, and thus classify them as an active ICAP mode. 


2.2.3 Parsons Puzzles 

Parsons Puzzles (PP) present snippets of code that have been 
separated into segments and then shuffled in order [21]. Students 
are then tasked with placing the segments back into the correct 
order. While Parsons Puzzles are helpful for learning how to 
structure code, performance on Parsons Puzzles has not been 
shown to correlate with students’ ability to read or trace code 
[11, 20] and ‘distractor’ variants are not beneficial to young learn- 
ers [13, 16]. These findings further support our research goals, 
as not every exercise type may be beneficial for learning all the 
technical skills necessary for Computer Science. 


Chi and Wylie’s definition of constructive modes of engagement 
include “learners generate or produce additional externalized out- 
puts... beyond what is provided”. As mentioned in the previous 
section, Parsons Puzzles are similar to our toy peanut butter and 
jelly exercise. While Parsons Puzzles could be construed as active, 
students may not fully comprehend the appropriate order of code 
syntax and must figure out the right sequence as part of the exer- 
cise. In our communications with Chi about Parsons Puzzles, Chi 
states that determining the particular ICAP mode for an activity 
can be non-trivial [9]. ‘If a student already [knows] the recipe, then 
re-ordering it is just picking out the sequencing, guided by the se- 
quence information [they] already know’ then it is active. However, 
‘if the student [has] not learned the order from some other source, 
and you are asking her to figure out the right sequence’, it is a con- 
structive exercise. Since novices may not have proficiency at the 
time of the exercise, we elect to use the upper bound ICAP mode 
and consider Parsons Puzzles as a constructive learning activity. 


2.2.4 Output Prediction 

Output Prediction (OP), also known as variable tracing, exercises 
ask students to analyze code and then state the expected outputs 
of code execution or the expected value of a variable as the code 
progresses. Often, variable tracing is done during conditional and 
loop instruction to demonstrate how the values of the variables 
change after each iteration. Like Parsons Puzzles, OP-style ex- 
ercises require students to process code snippets and externalize 
their expected outputs. Thus, we consider output prediction as 
another constructive learning activity. 


2.2.5 Self-Explanation 

Selfexplanation (SE) exercises present students with source code 
and ask the student to explain how the code operates, describe 
the overall efficiency of the code, or create a documentation string 
to appear as a comment for the program or function. These 
are open-ended exercises that are subjective in nature and are 
considered to be constructive [10]. However, Chi and Wylie do 
note that students’ may treat the self-explanation activity as 
active if “the student’s self-explanation is verbatim to what was 
read”. However, novices may struggle with reading and evaluating 
programming code in a linear fashion, focusing more on what 
each line of code did, rather than how each line interacted with 
each other, or in general produce poor explanations [25, 5]. While 
constructive SE activities may produce higher learning gains than 
lower-level modes, they may also not be the most appropriate 
activity for students who are struggling, Thus, similar to our 
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decisions for Parsons Puzzles, we use the upper bound to classify 
sel-explanation as a constructive learning activity. 


2.2.6 Find and Fix the Bug 

Resolving errors, or debugging, is one of the first hurdles students 
encounter when learning to program [3]. Once they find that 
an error has occurred, it will be necessary for them to resolve 
it before addressing any remaining subgoals for their solution. 
For the purposes of our study, we separated debugging into two 
separate activities - Find the Bug (F'n B) and Fix the Bug (F'xB). 
Find the Bug exercises present students with code that contains a 
common misconception for novices. Instead of resolving the error, 
students are asked to highlight the area of code where this error 
exists. Though the ICAP framework considers highlighting text 
as an active learning activity, Chi and Wylie define constructive 
behaviors as requiring some level of “inference”, or adding in 
additional detail or qualification. Since students must assess if 
a line of code is ‘correct’ or not, they are producing qualifications 
and therefore, we elect to label FnB exercises as constructive. 


Fix the Bug exercises follow the natural progression of debug- 
ging tasks by requiring students to resolve broken code[8]. FxB 
activities could be considered constructive or interactive depend- 
ing on the context. Similar to FnB exercises, Chi and Wylie 
include ‘repairing’ as a constructive behavior. However, FxB-style 
exercises can also be interactive because students often rely on the 
code interpreter’s feedback during the debugging process. Fixing 
one error may produce new errors with new feedback, or the 
repair made by the student could be incorrect. Thus, we again 
choose use the upper bound to label FxB exercises as interactive. 


2.2.7 Coding Exercises 

The final exercise type we used in our study is the de facto 
standard of introductory CS courses - the Coding Exercise (C'E). 
While Computer Science is more than programming, coding ex- 
ercises are often used by instructors as graded course material for 
students to demonstrate their understanding of the current course 
topic. There has been work on the use of ‘many-small programs’ 
and ‘simple syntax exercises’, which simply require students to 
complete small-scale coding exercises to become familiar with the a 
particular implementation before utilizing it as part of larger-scale 
problems [2, 12]. In both cases, completing these smaller-scale 
programming exercises improved student performance and yielded 
happier students. Students often rely on feedback from the inter- 
preter as they construct their solutions and more than likely need 
to debug their own work during this process. Thus, we consider 
Coding Exercises as an interactive learning activity. 


2.3 Student Modeling and Activity Mining 

Seshadri et. al. analyzed how student study sessions operated 
across multiple platforms for three separate courses [1]. Their 
results found that given multiple education platforms, students 
will often operate within platform silos, or only utilize one educa- 
tional platform during an individual study session. Of the student 
sessions, more than 90% of them included only one platform. In a 
follow-up study, they compared the activities of higher performing 
students to the lower performing group and showed that both 
groups were most likely to stick within platform silos [15]. Their 
work serves as the motivation for our RQ1. Our hypothesis is 
that the presence of these ‘platform silos’ will continue to hold 
across other educational platforms not studied in their research. 


3. STUDY 
3.1 Design 


We studied problem solving in the context of an online professional 
development course in Python programming. This is a preparatory 
course for a series in AI that is aimed at non-traditional students 
making a career transition. The course used the Moodle Learning 
Management System and TYPOS, a CS exercise platform [14], and 
is organized into 10 modules Each module includes a set of static 
reading material, lecture slides, prerecorded videos, optional prac- 
tice exercises, and a module assessment. There were 24 optional 
exercises per module, 3 exercises for each of the 8 types previously 
described. Students were free to work on the practice exercises or 
assessments as much as they liked. The only requirement for pro- 
gressing to the next module was to earn a passing grade (80% or 
higher) on the prior module’s assessment. In order to complete the 
course, students needed to earn passing grades on all assessments. 


We had 69 students consent to the study. Of those students, 37 
successfully completed the course. Student interactions on both 
platforms were logged. We omitted some Moodle interactions 
such as like “file downloading” and “viewing the course” which did 
not pertain to the explicit learning actions we were foucused on. 


The resulting dataset contained a total of 29,190 interactions 
from all the students. We then used a similar strategy as [1] 
to extract user sessions from these interactions based upon an 
exploratory analysis of the gaps between interactions. The time 
deltas between course interactions were measured. If a delta 
between interactions exceeded a predefined cutoff threshold, that 
session was considered over, and a new session was created. This 
was repeated for all interactions a student had for the course. 
While Seshadri et. al. used a 40 minute cutoff to establish the end 
and start of a new session, we chose a 60 minutes as it was our most 
frequently observed delta between interactions. We extracted 1,313 
sessions in total. Students that completed the course accounted 
for 71%, or 20,748, of the course interactions, with 1,041 sessions 
total, at an average of 28.1 (+17.9) sessions per completer. 


3.2 Activity Session Transition Probabilities 
The route that students take through online materials can be 
modeled as discrete Markov processes, in which each state rep- 
resents an activity within the session. For example, a student 
may transition from reviewing lecture slides to viewing lecture 
videos on Moodle, or MS— MV. Jeffries et. al. [17] used a 
similar process to analyze success and help seeking behaviors with 
students in an introductory CS course. 


Figure 1 visualizes the transition probabilities for completers: 
transitions within TYPOS appear as dashed blue lines, transitions 
within Moodle are solid red lines, and transitions between TYPOS 
and Moodle are solid black lines. For visibility, only transitions 
involving the start/end of a session or those with a frequency 
above 5% are presented. Module assessment (M.A) accounted 
for 39% of starting session behavior, TYPOS practice accounted 
for 36%, and lecture slides and videos (Content Consumption) 
accounted for 26%. This figure shows students’ practice was 
largely siloed by platform with each session taking place within a 
single mode of interaction. With the exception of the MS->TE 
transition, students either interacted with TYPOS or Moodle, 
but rarely together. Since 7A showed the highest starting ses- 
sion probability, one assumption is that students enrolled in the 
course with prior coding experience may have reviewed the course 
material to become familiar with Python syntax before going on 
to complete the module assessment. 
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Platform | Label | Activity Description 
Moodle MS _ | Lecture Slides 
|MV__ | Lecture Videos 
MA __| Module Assessment 


TYPOS |TE _ | Typing Exercises 
FitB | Fill in the Blank 
PP Parsons Puzzles 


OP | Output Prediction 
FnB | Find the Bug 
FxB | Fix the Bug 

SE Self-Explanation 
CE | Coding Exercise 


Figure 1: Completer activity transition probabilities during sessions. 


One interesting observation from Figure 1 is that TYPOS 
practice sessions involved only one or a small subset of the ex- 
ercise types available to the students. For example, the TE, PP, 
and FitB exercises were typically completed alone. Among two 
exercise pairs, S&#>CE, OP FnB, and FnB— FxB showed 
higher probabilities than ending the session. 


We can further classify the activities by their respective ICAP 
modes. For example, the SE — CE transition can also be viewed 
as a Constructive Interactive transition, MS — TE can be 
viewed as Passive Active, and so on. From this perspective, 
transitions between activities rarely ‘downgraded’ to a lower mode. 
With the exception of MS— MA and MV > MA, mode changes 
primarily shifted by one level of engagement. 


3.3. Activity Session Factor Analysis 

The probabilities shown in Figure 1 raise new questions about 
students’ practice behaviors. Not only were we able to confirm 
the existence of platform silos from our RQ1, but based on the 
observed transition probabilities, we found that students may also 
operate within what we term an activity silo, focusing primarily 
on one ICAP mode per session. In this section we report on two 
factor analyses which we use to identify the latent variables for 
each practice session in order to strengthen our claim. 


Factor Analysis is a dimension reduction method to describe 
the variability of observed variables into, potentially, lower latent 
variables, or factors. To prepare our dataset for factor analysis, 
each activity was converted into a binary value, representing the 
presence or absence of the activity in the session. For example, if 
a session only involves passive content consumption, the resulting 
vector for the session would be [1,1,0,0,0,0,0,0,0,0,0], where 
the 1s represent the presence of lecture slides and videos and 0 
represents the absence of all other activities. 


In order to evaluate the appropriateness of our data for fac- 
tor analysis, we use the Bartlett’s and Kaiser-Meyer-Olkin tests. 
Bartlett’s test compares our correlation matrix against an iden- 
tify matrix to test whether our samples are from populations 
with equal variance. Our samples were statistically significant 
(x? =3360.05,p=0.0) and thus we can continue with our factor 
analysis. Kaiser-Meyer-Olkin checks the adequacy for our vari- 
ables to determine the suitability of factor analysis. Our KMO 
score was 0.83, which again shows our dataset is adequate. 


The next step for our analysis was to determine the appropriate 


number of factors. Table 1 shows the eigenvalues for each factor 
and their cumulative variance. Based on these results, we utilized 
two separate factor analyses. The first analysis uses 3-factors to 
correspond to the 3 eigenvalues greater than 1, as suggested by 
Kaiser [18]. The second analysis increases to 5-factors based on 
the variance extraction rule, which specifies a 0.7 threshold for 
eigenvalues [6, 24, 23]. 


Table 1: Eigenvalues for Factor Analysis of Completers 


Factor | Eigenvalue | Cumulative Variance 
1 3.860695 30.51% 
2 1.390921 37.21% 
3 1.170659 41.79% 
4 0.916401 52.26% 
5 0.810977 62.27% 
6 0.665925 60.91% 
7 0.649612 69.59% 
8 0.486275 70.05% 
9 0.446064 55.23% 
10 0.391094 55.42% 
11 0.211376 55.42% 
Our next task was to identify load factor thresholds for our 


latent variables. While there is no universal standard for loading 
thresholds, the goal is to only observe variables that share a strong 
association with each other and is a non-trivial process [22]. For 
our paper, we will focus our attention to variables above a 0.50 
(or 25% of the variable’s variance) threshold, highlighting them 
in our tables as green. Since the difference between a 0.50 and 
0.49 loading is minimal, we will also highlight values greater than 
0.4 (or 16% variance) in yellow for additional reference. 


Table 2 shows the factor load values for our 3-factor analysis. 
F1 has high loadings for OP, FnB, FxB, SE, and CE. If 
we consider the ICAP modes for this factor, this indicates a 
transition of Constructive > Interactive TYPOS Practice. F2 
has high loadings for FitB and PP, or Active Constructive 
TYPOS Practice. Finally, F3 has high load for MS, or Passive 
Moodle Interaction. From Table 2, we once again can confirm the 
presence of platform silos, however we expand our factor analysis 
in order to see the presence of activity silos. Moreover, 3-factors 
only accounts for 41.79% of the cumulative variance and so using 
the 0.7 eigenvalue threshold will allow us to account for 62.27%. 


And finally, Table 3 shows the factor load values for our 5- 
factor analysis. Similar to Table 2, we see a separation between 
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Table 2: Loadings for 3 Factor Analysis for Completers. Values 
greater than 0.4 are in yellow and greater than 0.5 are in green. 


Activity Fl F2 F3 
MS 0.0306 -0.0294 0.5073 
MV -0.1006 -0.0999 0.3891 
MA -0.0269 -0.2862 0.2177 
TE 0.0193 0.4336 -0.1345 
FitB 0.4703 0.5471 -0.0286 
PP 0.3300 0.6726 0.0392 
OP 0.6257 0.3885 0.0216 
FnB 0.8215 0.2149 0.0307 
FxB 0.8399 0.1546 0.0059 
SE 0.6802 0.1020 -0.0933 
CE 0.5019 0.0135 -0.1306 


TYPOS activity and Moodle activity. Further, the activities for 
each factor are confined to a single ICAP mode, or within one 
mode. F1 contains mostly Constructive activities (as well as 
F xB), F2 contains Active> Constructive activities, F3 contains 
Passive activities, and F4 and F5 contain Interactive activities. In 
addition, F3 and F4 show a separation between Passive Moodle 
content consumption and Jnteractive assessment taking. Thus, 
from the results of our 5-factor analysis, we confirm the presence 
of activity silos within our students. 


Table 3: Loadings for 5 Factor Analysis for Completers. Values 
greater than 0.4 are in yellow and greater than 0.5 are in green. 


Activity FL F2 F3 F4 F5 
MS 0.0224 -0.0065 0.2411 0.1061 -0.0150 
MV -0.0836 -0.1009 0.9838 -0.0982 -0.0264 
MA -0.0438 -0.1664 0.1175 0.9759 -0.0124 
TE 0.0364 0.3448 -0.0843 -0.2002 -0.0106 
FitB 0.4344 0.5629 -0.0451 -0.0445 0.0772 
PP 0.2740 0.7467 0.0004 -0.0193 0.0369 
OP 0.5520 0.4568 -0.0008 0.0282 0.1792 
FnB 0.8419 0.2321 0.0035 -0.0110 0.0829 
FxB 0.8759 0.1568 0.0058 -0.0255 0.0980 
SE 0.5963 0.1562 -0.0318 -0.0450 0.2602 
CE 0.3227 0.0549 -0.0631 -0.0088 0.8891 


3.4 Comparing Completers to Non-Completers 
Having shown the basic activity structures and identified relevant 
factors we then chose to explore was the difference between com- 
pleter and non-completer students. We used the same methods for 
the non-completer group for comparison. There were 32 students 
that failed to complete our course. Non-completers made 8,442 
course interactions across 341 sessions, with an average 10.7 (7.8) 
sessions per non-completer. 


We first produced the same transition probabilities diagram 
for non-completer activity sessions, seen in Figure 2. Similar to 
completers, module assessment accounted for 31% of starting 
session behavior, TYPOS practice accounted for 49%, and lecture 
slides and videos (Content Consumption) accounted for 22%. Non- 
completers primarily operated within a single platform, though 
there was more interactions between Moodle and TYPOS. For ex- 
ample, 12% of MV’s transitions migrated to TYPOS exercises and 
5% of SE transitions migrated to MA. While completer students 
separated SE->CE and OP> FnB- FB transitions, these 
two sequences were combined for non-completers. However, this 
could potentially be due to the size differences. Both populations 
had similar population sizes, but non-completers did not complete 
each module assessment and would not produce as many sessions. 


We then carried out the same factor analyses for non-completers. 
The results of a Bartlett’s test showed statistically significant 


differences (y? =997.8,p > 0.0001) and our KMO score was also 
adequate for analysis (0.77). Similar to Table 1, we found support 
for 3- and 5-factor analysis, seen in Table 4. We note that a 
6-factor analysis is also possible, but to mirror the factor analysis 
for completers, we elected not to pursue it. 


Table 4: Eigenvalues for Factor Analysis of Non-completers 


Factor | Eigenvalue | Cumulative Variance 
1 3.480694 26.26% 
2 1.503161 34.57% 
3 1.286916 41.39% 
4 0.888718 47.44% 
5 0.821608 54.51% 
6 0.719444 62.12% 
7 0.657629 66.31% 
8 0.520019 63.79% 
9 0.499220 57.46% 
10 0.375677 57.69% 
11 0.246915 57.69% 


Table 5 shows our 3-factor analysis for non-completers. The 
same activities having high loadings as F1 and F2 as the 3-factor 
analysis for completers (‘Table 2) and also show similar platform 
silos. Likewise, the ICAP mode considerations are similar for 
each factor. Fl shows Constructive Interactive behaviors, F2 
shows Active Constructive behaviors, and F3 shows Passive 
Moodle interaction. 


Table 5: Loadings for 3 Factor Analysis for Non-completers. 


Activity Fl F2 F3 
MS 0.0362 0.0014 0.4167 
MV -0.0334 0.0106 0.6616 
MA -0.0136 -0.2524 0.2999 
TE 0.0756 0.4637 -0.0719 
FitB 0.3176 0.6024 -0.0202 
PP 0.1082 0.7404 0.0221 
OP 0.5310 0.3499 0.1105 
FnB 0.5573 0.4837 -0.0910 
FxB 0.6636 0.3369 -0.1218 
SE 0.8012 0.0568 0.0573 
CE 0.5900 0.0201 0.0056 


Table 6 shows our 5-factor analysis for non-completers. Non- 
completers maintained the Constructive Interactive connection 
for F1 and F2 also maintains the Active Constructive connection. 
The remaining factors do differ, F3 separated the FnB- F'xB 
exercises from F1 and F4 focuses primarily on TE. The ab- 
sence of MA was expected since course progression requires 
passing module assessments. From our analysis, we conclude that 
non-completers still operated within activity silos. 


Table 6: Loadings for 5 Factor Analysis for Non-completers. 


Activity FL F2 F3 F4 F5 
MS 0.0271 0.0234 0.0080 -0.0129 0.4040 
MV -0.0191 0.0204 -0.0417 0.0373 0.7123 
MA 0.0049 -0.1705 -0.0558 -0.1637 0.2922 
TE 0.0541 0.2612 0.0605 0.9570 -0.0716 
FitB 0.2259 0.6333 0.1698 §=0.1275 = -0.0557 
PP 0.0224 0.6734 0.1463 0.1925 -0.0155 
OP 0.4707 0.4584 0.1711 0.0080 0.0735 
FnB 0.2516 0.4546 0.6255 0.0378 -0.0543 
FxB 0.3618 0.2049 0.8444 0.0752 -0.0706 
SE 0.8072 0.0883 0.2525 0.0740 0.0496 
CE 0.6201 0.1006 0.1092 -0.0054 -0.0206 
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Figure 2: Non-completer activity transition probabilities during sessions. 


4. DISCUSSION 


The results from our probability transition diagrams confirm the 
presence of both platform silos and activity silos in student work. 
They also serve to highlight areas where educators and researchers 
can tailor more appropriate learning paths for students and in par- 
ticular those students that may be struggling with course material. 


Our study allowed students to self-select which activities they 
wanted to focus their attentions on. While this style of course 
design could be adopted, it does still present limitations. However, 
students that primarily focus on lower-level ICAP mode activities 
may be reluctant to move into high-level modes. Instructors or sys- 
tems that can identify this stagnate practice behavior could encour- 
age students to move into high-level ICAP modes. There is grow- 
ing interest in the concept of nudge theory to “alter behavior with- 
out incentives or banning alternatives” [7] to encourage progression. 


Similarly, we presented students with a number of different 
activities, at different complexities, for their learning experience. 
Based on our results, students were more than willing to complete 
each type of exercise. Some students even asked for more activ- 
ities in our post-course survey. While increasing the workload for 
students and learning material creators, many of the activities 
we used are not overly complex and required a minimal amount 
of time to create, or from the students’ perspectives complete. 
Activities like typing exercises or Parsons puzzles can be created 
from existing course materials and offer little incentive for students 
to cheat. They simply allow students an opportunity to practice 
the concepts they learned rather passively given to them, refining 
their understanding, before needing to apply it to problem solving 
activities like coding exercises. 


5. LIMITATIONS 


We acknowledge some limitations with our study. First, our course 
ran during the COVID-19 pandemic, which has altered many 
individuals’ habits. Our population also contained non-traditional 
students who were balancing their studies with working from home 
and supporting other family members. Thus, non-completion may 
have been driven by external constraints that are not reflected 
in our dataset, and the observed habits may change somewhat 
during non-COVID times. 


Second, the exercise types were presented in a consistent man- 
ner for each module. Thus they were implicitly sequenced with 
lower-level ICAP modes appearing on the top. As we mentioned in 
our introduction, discerning the appropriate order for 11 different 
activities is a non-trivial matter and measuring the appropriate 
order of exercise types was not a part of our study. Thus, we 
presented exercises in an order that progressively increased the 
level of engagement. This may have influenced next practice 
selections by students. 


Finally, we acknowledge that the ICAP modes associated with 
each exercise type are somewhat subjective and open for discussion. 
Moreover the exact evaluation of exercises like Parsons Puzzles 
or self-explanation may require additional research and context. 
For the purposes of this study, when faced with uncertainty we 
classified exercises according to a higher level mode of interaction. 


6. CONCLUSIONS 


In this work, we extracted the practice and study session behaviors 
from non-traditional students learning Python. Among completers 
and non-completers of the course, they primarily focused on a sin- 
gle platform. The activities within these platforms were mapped 
to the ICAP framework. Further, we used factor analyses to 
identify the presence of activity silos within practice sessions. 
Completers and non-completers shared similar behaviors during 
these practice sessions, primarily focusing on one or two modes 
of engagement and rarely ‘downgraded’ to lower level modes. 


We can utilize these activity sequences to help shape our overall 
course designs for ensuring student learning. Lower-level activities 
can provide students with the foundational knowledge necessary 
as a part of the technical skills for the content, while higher-level 
activities can refine and encourage additional learning gains. As 
the research in this area expands, we hope the information pre- 
sented in this study encourages educators and researchers alike 
to provide practice in both levels and can serve as a guide for 
recommendations on how to best build long-term proficiencies. 
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ABSTRACT 


In educational applications, Knowledge Tracing (KT) has 
been widely studied for decades as it is considered a funda- 
mental task towards adaptive online learning. Among pro- 
posed KT methods, Deep Knowledge Tracing (DKT) and 
its variants are by far the most effective ones due to the 
high flexibility of the neural network. However, DKT often 
ignores the inherent differences between students (e.g. mem- 
ory skills, reasoning skills, ...), averaging the performances 
of all students, leading to the lack of personalization, and 
therefore was considered insufficient for adaptive learning. 
To alleviate this problem, in this paper, we proposed Leveled 
Attentive KNowledge TrAcing (LANA), which firstly uses a 
novel student-related features extractor (SRFE) and pivot 
modules to distill and distinguish students’ unique inherent 
properties from their respective interactive sequences. More- 
over, inspired by Item Response Theory (IRT), the inter- 
pretable Rasch model was used to cluster students by their 
ability levels, and thereby utilizing leveled learning to assign 
different encoders to different groups of students. With pivot 
module reconstructed the decoder for individual students 
and leveled learning specialized encoders for groups, person- 
alized DKT was achieved. Extensive experiments conducted 
on two real-world large-scale datasets demonstrated that our 
proposed LANA improves the AUC score by at least 1.00% 
(ie. EdNet t 1.46% and RAIEd2020 t 1.00%), substantially 
surpassing the other State-Of-The-Art KT methods. 
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1. INTRODUCTION 


Knowledge Tracing (KT) aims to accurately retrieve stu- 
dents’ knowledge states at a certain time by his past sequen- 
tial exercising interactions. To evaluate KT’s performance, 
it is asked to predict the correctness of students’ future ex- 
ercises with the retrieved knowledge states as Equation 
represented. 
PCr le, Ty ’ I, aig TP ’ yas Ip = eee ’ = a 
1 
Where e;’’” is referred as student s; € N+ answering ques- 
tion q; € Nt at discrete time step t € NT, c% represents 
the contextual information of question q; (e.g. related con- 
cepts, part, etc.) [4], and r;' € {0,1} repre- 
sents the correctness of student s;’s answer to q; at time 
t. Additionally, the student’s interaction sequence is de- 
fined as Si. = {If'|to < t < ti} and « is defined as 
Ki? = {8:,q;,c%,r;'}, referring to all features that partici- 
pated in one interaction J;* for latter explanation. 


Traditionally, KT was regarded as a sequential behavior 
mining task 17], and therefore various methods estab- 
lished models with the theory of bayesian probability (BKT [3)) 
and psycho-statistics (IRT 5), providing excellent inter- 
pretability and good performance. Nevertheless, recently 
proposed Deep Knowledge Tracing (DKT) and its vari- 
ants significantly outperform other KT 
methods in metrics using Recurrent Neural Network (RNN) 
and Long Short Term Memory (LSTM [6]). However, DKT 
distinctly lacks personalization for students compared to 
BKT and IRT [15] [25], which are capable of separately train- 
ing unique models for each student, while DKT only trains 
a unified model for all students due to massive training data 
and abundant computing resources required by deep learn- 
ing. Hence, DKT weakly reflects the large inherent property 
(i.e. memory skills, reasoning skills, or even guessing skills) 
gaps between students. 


ASSUMPTION 1. For any interactive sequences satisfying 
Se 4,|] > O >> 1, ||K|| > U and te-ti > E, Sf',, can be 


to,t1 to,t1 


distinguished from Bis and Si ,, respectively. 


Is it possible to bring personalization back to DKT? To an- 
swer this question, we observed that the proactive behavior 
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sequence (i.e. interactive sequences) of each individual is 
unique and changeable over time. Hence, we argue that the 
minimal personalized unit in KT is “a student at a certain 
time t,;” instead of just “a student”, and student’s inherent 
properties at time t; can be represented by his interactive 
sequences around time ¢; (Assumption [ip. In such a way, 
these student-related features could tremendously help per- 
sonalize the KT process since they could be used to identify 
different students at different stages. Consequently, in our 
proposed Leveled Attentive KNowledge TrAcing (LANA), 
unique student-related features are distilled from students’ 
interactive sequence by a Student-Related Features Extrac- 
tor (SRFE). Moreover, inspired by BKT and IRT that assign 
completely different models to different students, LANA, as 
a DKT model, successfully achieves the same goal in a dif- 
ferent manner. Detailedly, instead of separately training 
each student a model like BKT and IRT, LANA learns to 
learn correlations between inputs and outputs on attention 
of the extracted student-related features, and thus becomes 
transformable for different students at different stages. More 
specifically, the transformation was accomplished using pivot 
module and leveled learning, where the former one is a model 
component that seriously relies on the SRFE, and the lat- 
ter one is a training mechanism that specializes encoders for 
groups with interpretable Rasch model defined ability levels. 
Formally, the LANA can be represented by: 


Adaptive by Pivot Module 
A 
re ~ (fp ))(he®) 5 Deh ~ Rh), he’ ~ g(hSt, Sot), 
ee 
Adaptive by Leveled Learning 
(2) 
where h;' is referred as student s;’s knowledge state at time t 


respectively, f(-) (decoder), g(-) (encoder) and k(-) (SRFE) 
are three main modules that LANA seeks to learn. 


2. METHODOLOGY 
2.1 Base Modifications 


There are mainly two base modifications in the LANA model 
(Figure|1) that were made to the basic transformer. Firstly, 
in the LANA model, the positional information (e.g. posi- 
tional encoding, positional embedding) was directly fed into 
the attention module with a private linear projection, in- 
stead of being added to the input embedding and shared 
the same linear projection matrix with other features in 
the input layer. Although experiments in suggested 
that blending input embedding with positional information 
is effective, recently some work debated that when the 
model becomes deeper, it tends to “forget” the positional 
information fed into the first layer. Moreover, some other 
work believed that adding positional information to the 
input embedding and offering them to the attention module, 
is essentially making them share the same linear projection 
matrix, which is not reasonable since the effects of the input 
embedding and the positional information are clearly dis- 
tinctive. For exactly the same reason, in the LANA model, 
multiple input embeddings (i.e. question ID embedding, 
student ID embedding, etc.) are concatenated instead of 
added, leading to the second base modification. Specifically, 
assumes there are m input embeddings in total, each with 
a dimension of D/. Then after concatenating, the input 
embedding would have a total dimension of D™!. Hence, a 
D™! — D?! linear projection layer was used to map the con- 
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Figure 1: The overall model architecture of LANA. 
There are mainly three differences compared to vanilla 
transformer-based KT method [18]: I. Modifications to 
the basic transformer model. II. Introduced SRFE and III. 
Introduced PMA Module and PC-FFN Module, which col- 
lectively referred to as pivot module. 


catenated input embedding of dimension D™! to dimension 
DF, 


2.2 Student-Related Features Extractor (SRFE) 
Student-Related Features Extractor (SRFE) summaries stu- 
dents’ inherent properties from their interactive sequences 
with Assumption [I] for the pivot module to personalize the 
parameters of the decoder. Specifically, SRFE contains an 
attention layer and several linear layers, where the atten- 
tion layer was used to distill student-related features from 
the provided information by the encoder, and the linear lay- 
ers were leveraged to refine and reshape these features. It is 
notable that in the LANA model there were primarily two 
SRFEs: memory-SRFE and performance-SRFE, where the 
former one was utilized to derive students’ memory-related 
features for the PMA module (be introduced later) and the 
latter one was dedicated to distill students’ performance- 
related features (i.e. Logical thinking skill, Reasoning skill, 
Integration skill, etc.) for PC-FFN module (introduced later 
either). The reshaping process was drawn in Figure |3| for 
better illustration, where bs, Nheads, seq and dpi, are re- 
ferred to as the model’s batch size, the number of atten- 
tion heads [22], the length of the input sequence and the 
dimension of performance-related features. The intuition 
that memory-related features have a second dimension of 
Nheaad Comes from the theory that each attention head only 
pays attention to one perspective of the features. Thus it is 
reasonable that each student has different memory skills for 
different attention heads (e.g. for different concepts). 


2.3 Pivot Module 
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Figure 2: The workflow of leveled learning: interpretable 
Rasch model was leveraged to analyze students’ overall abil- 
ity levels, and then cluster students into multiple layers, 
where each layer would respectively fine-tune the LANA 
model by its own training data. 


Linear 
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Figure 3: The data shape transformation of two SRFE: 
Memory-SRFE and Performance-SRFE. 


Provided an ordinary input x, a student-related features p 
and a target output y, pivot module learns the process of 
learning how to project x to y based on p, instead of simply 
learning to project x to y (i.e. Pivot module learns to learn) 
as Equation |3] shown. 


y = (f(p))(2), (3) 


where f(-) here is the function that pivot module learns to 
learn. That is, the projection matrix of x is adapted to p 
instead of being fixed. To accomplish this dynamic mapping, 
the weight and bias of x need to be a projection from p. 
Assumes p € R??, « € R?* and y € R”?, Equation[3] could 
be formally presented in Equation [4} 


y= We +0", (4) 


where W* € R?¥*?= and b® € R?”. Since W” and b” is de- 
rived from p, the detailed transformation could be revealed 
in Equation] which was also depicted in Figure[4]for better 
illustration. 


W* =Wrp+i, 6” =Wipt ob, (5) 


where WP € R(PuXP2)xPp oP C RMP uxXPa) WP c RPuXPr 
and bf € R?». 


By simplification, Equation [3] can be defined as Equation [6] 
being named as PivotLinear (2, p). 


y = (Wp)x + b = PivotLinear (x, p), (6) 


where W € R?vXPe*Pp and bE RY. 


In the LANA model, there are primarily two modules that 
pertain to the pivot module: Pivot Memory Attention (PMA) 
Module and Pivot Classification Feed Forward Network (PC- 
FFN) Module. In many methods [14], Vanilla Mem- 
ory Attention (VMA) Module was employed to consider the 
“forgetting” behavior of students, which is pivotal in KT’s 
context since students are very likely to have done similar 
exercises to the one he is going to do, and if the student 
could remember the answers to previous similar exercises, 
the probability of him correctly answering the future related 
exercises will be increased greatly. Inspired by the Ebbing- 
haus Forgetting Curve and much previous work (4, 
“forgetting” behavior of students are defined as exponentially 
decaying weights of corresponding interactions in the time- 
line. Detailedly, in the original attention module, the weight 
of item j on item k, i.e. aj,~, is determined by the sigmoid 
result of the similarity between item j and item k: 


hes sim(j, k) 
jk Yo, sim(j, k’)’ 


where sim/(-) is a function to calculate the similarity between 
item 7 and item j by dot production. In order to take “for- 
getting” behavior into a;,x’s account (e.g. The further away 
from j, the lower the weight a;,, would be), we replaced 
Equation |7| with Equation 


oe (9+m)-dis(j,k) 


(7) 


Vex’ sim(Z, k’) 

where m is the student’s memory-related features extracted 
in memory-SRFE, 6@ is a private learnable constant that de- 
scribes all students’ average memory skill in the PMA mod- 
ule, and dis(-) calculates the time distance between item 
j and item k (e.g. item j is done dis(j,k) minutes after 
item k is done). The reason for representing the memory 
skill with two learnable parameters is to reduce the diffi- 
culty for model converging since m has a much longer back- 
propagation path compared to 0. When @ is introduced to fit 


(8) 


A5,k,m = 


ae 
Pp x WPETT | + pPLL TT) = 6 
Beis 
p x wP + bP = w* 
xi x Wwe + b* = JZ 


Figure 4: An illustration of the data transformation in the 
pivot module. 
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the average memory skill of all students, the distribution of 
m becomes a Gaussian distribution, which makes the model 
much easier to learn. 


On the other hand, PC-FFN was utilized to make the final 
prediction in reference to the performance-related features, 
which essentially is a PivotLinear module with a dropout 
and activation. The idea of this module comes from many 
investigations that the early layers in a deep neural network 
are often used as a feature extractor while the latter layers 
are often used as a decision-maker to decide which feature 
is useful to the output of the model. As a result, these in- 
vestigations point out that many models are actually having 
similar early layers, and it is the latter layers that make these 
models distinctive in usage. Consequently, PC-FFN in the 
LANA model was utilized as a personalized decision-maker 
to adaptively make the final prediction based on students’ 
distinctive inherent properties: 


PC—FFN(a, p) = x+PivotLinear(PivotLinear(x, p), p), 

(9) 
where p is the students’ performance-related features ex- 
tracted in the performance-SRFE. 


2.4 Leveled Learning 

While the pivot module enables the decoder to be trans- 
formable for different students, the encoder and the SRFE 
of the LANA model that provides necessary information for 
the pivot module remains the same for all students. This is 
not problematic if the length of the input sequence is large 
enough since Assumption [I] assures long sequences are al- 
ways distinguishable, unless they both belong to the same 
student at the same time period. However, DKT, espe- 
cially transformer-based DKT, can only be inputted with 
the latest n (commonly n = 100) interactions at once due to 
the limited memory size and high computational complex- 
ity. Consequently, it is possible for the encoder and SRFE 
to output similar results for two different students, resulting 
in a failure for the decoder to adapt. To alleviate this prob- 
lem, it is natural to think of assigning different students with 
different encoders and SRFEs that are highly specialized 
(sensitive) to their assigned students’ patterns. However, 
in practice, it is not feasible to train a unique encoder for 
each individual student considering both the limited train- 
ing time and the limited training data. As a result, a novel 
leveled learning (F igure [2) method was proposed to address 
this problem, which was initially inspired by the fine-tuning 
mechanism in transfer learning [20], where we consider each 
student a unique task, and we want to transfer a model that 
fits well on all students to one student s, efficiently. 


Leveled learning holds the view that the earlier layers of a 
model are similar for similar tasks. Thus, to save training 
time and enlarge the training set, instead of training each 
student a unique encoder and SRFE by his private train- 
ing data, students with similar ability levels are considered 
to be grouped together, sharing their private training data 
and having the same encoder and SRFE. Therefore, LANA 
firstly utilizes an interpretable Rasch model to analyze the 
ability level a** for each student s;, then groups students 
into different independent layers 1;. Assuming the ability 
distribution of all students and students at the level J; are 
Gaussian distribution N(ja,02) and N(i,0?) respectively, 


we have the Equation [10} 


ps ae a (10) 


L 


In LANA, for simplicity, we consider all layers share the 
same variance o7 and the difference of mean jz; between 
consecutive layers is a constant 7. Hence, 4; and o? are 
given by: 

L-1 2 Oo Z 


[i = [a 5 xT+1iXT, w=: (11) 


where L = ||l;|| is the number of layers. With both ju; and a? 

retrieved for every layer l;, given a student’s ability constant 

a*?, we can now calculate the probability of s; been grouped 
into different layers by Equation [12] 

84 _4.)2 

;(a*) 1 = oo 


Dy = yy oi (@)’ ae ae oi 20 


where p;* is referred as the probability of student s; be- 
ing grouped into layer l;. As it can be seen from Equa- 
tion students that have high ability levels are not neces- 
sarily grouped into layers with high expected ability levels 
fui. Contrarily, these high ability students only have a higher 
probability of been grouped into high ability layers in com- 
parison with those low ability students, which obeys rules 
in reality (e.g. high ability students may also come from 
normal schools). 


(12) 


Then, the LANA model that has been pre-trained on all stu- 
dents was duplicated DL times, each cloned model m; would 
be assigned to a layer 1; to be dedicatedly fine-tuned with 
l,’s private training data by weighted back-propagation: 


loss; = p; x loss(predict;, target), (13) 


where predict; is the prediction of the model m;. 


While the training phase of leveled learning seems promising, 
the inference phase of it suffers problems. The first prob- 
lem is how to make the prediction using multiple specialized 
models. In LANA, the prediction was made by top—k mod- 
els fusion. Detailedly, when student s;’s future responses 
are needed to be predicted, LANA firstly computes p;, then 
feed s;’s interactive sequence to all models m; that satis- 
fies p; € top — k(p), where k needs to be manually set up 
to control the predicting time. Then, the outputs of these 
models would be multiplied by sigmoid(p;) to form the final 
prediction. The workflow of leveled learning’s inference step 
could be described in Equation [14] 


ri = > (mi(z) x S> =), @ € {| pi € top—k(p) }, 
en! Pr! 


(14) 
where r; is the leveled learning’s final prediction and 2 is 
the input of the model. This workflow seems similar to the 
ensemble where multiple models are unitized to generate the 
final answer. Nonetheless, weights of models in LANA are 
probabilities that come from an interpretable Rasch model 


"In practice, if the number of layers is small, their variances 
then need to be manually measured and tuned based on the 
targets. If the number of layers is large, then multiple layers 
can be regarded as one layer and therefore sharing the same 
variance for all layers should be fine. 
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so that it is clear which model is dominant to x. Moreover, 
unlike in ensemble, where the role of each model is ambigu- 
ous, in LANA, every model has its explainable effect (e.g. Iz, 
is committed to high ability students, and therefore a stu- 
dent with large pr indicates he must be similar to those high 
ability students in Jz), suggesting that leveled learning sig- 
nificantly outperforms ensemble in interpretability. Detailed 
comparison was shown in Table On the other hand, the 


Table 1: 
COMPARISON BETWEEN LEVELED LEARNING AND 
ENSEMBLE 


Leveled Learning Ensemble 


Sub-set Select 
Interpretability 
Predicting Time 


Psycho-statistics Random 
Good Bad 
Controllable (top-k) | Uncontrollable 


second problem of the leveled learning is how to compute p; 
for students that LANA has never met in training, namely 
the “cold start” problem [24]. In vanilla KT context, we can 
only initiate newly arrived students’ ability levels to the av- 
erage ability level of all students. However, in practice, we 
can estimate their ability levels more accurately by asking 
them to do a couple of sample exercises or using ranking at 
school. 


3. EXPERIMENTS 


3.1 Experimental Setup 

In order to evaluate the effectiveness of the proposed LANA? 
we applied it to two real-world large-scale datasets in com- 
parison with many other State-Of-The-Art (SOTA) KT meth- 
ods. Specifically, EdNet and RATEd2020 are em- 
ployed in our experiments, where EdNet is currently the 
largest publicly available benchmark dataset in education 
domain, consisting of over 90,000,000 interactions and nearly 
800,000 students. On the other hand, RAIEd2020 is a re- 
cently published real-world dataset that has approximately 
the same size as EdNet with nearly 100,000,000 interactions 
and 400,000 students. Particularly, the average number of 
exercising interactions per student in RATEd2020 is double 
to EdNet’s. Moreover, 6 KT methods that had previously 
achieved SOTA performance have participated in the com- 
parison: DKT [16], DKVMN [26], SAKT [13], SAINT [I], 
SAINT+ [18], AKT [4]. In terms of the basic experimen- 
tal environment, all experiments were conducted with Py- 
torcl{?]1.6 on a Linux server that is equipped with an Nvidia 
V100 GPU. For hyper-parameters setup, the learning rate 
was set to 5e — 4 with AdamW optimizer, the length 
of the input sequence was set to 100, the batch size was 
set to 256, and other detailed configurations were listed in 
our source code. The input features « in EdNet contains 
Question ID, Question part, Students’ responses, Time in- 
terval between two consecutive interactions and Elapsed time 
of an interaction, whereas in RAIEd2020, a new feature is 
additionally added to «, which indicates Whether or not the 
student check the correct answer to the previous question. 
Finally, The Area Under the receiver operating character- 
istic Curve (AUC) was leveraged in our experiments as the 


“https: //github.com/Soptq/LANA-pytorch 
https://pytorch.org 


Table 2: 
THE AUC CoMmPARISON OF DIFFERENT METHODS 
TESTED ON EDNET AND RAIED2020 DATASETS 


Dataset Model AUC 
EdNet DKT 0.7638" 
EdNet DKVMN 0.7668" 
EdNet SAKT 0.7663" 
EdNet SAINT 0.7816 
EdNet SAINT+ 0.7913 
EdNet SAINT+ & BM _ 0.7935 
EdNet LANA 0.8059 

RAIEd2020 SAKT 0.7832 

RAIEd2020 AKT 0.7901 

RAIEd2020 SAINT+ 0.7956 

RAIEd2020 SAINT+ & BM 0.7991 

RAIEd2020 LANA 0.8056 
Table 3: 


INVESTIGATION OF THE EFFECTIVENESS OF DIFFERENT 
IMPROVEMENTS IN LANA 


Pivot Module 


Dataset BM PMA PC_FFN LL AUC Boost 
EdNet 0.7913 - 
EdNet v 0.7935 ‘+ 0.0022 
EdNet v 0.7997 + 0.0084 
EdNet v 0.7923 ‘+ 0.0010 
EdNet Vv _0.7933.—- t 0.0020 
EdNet v v 0.8029 + 0.0116 
EdNet v v 0.8015 ‘+ 0.0102 
EdNet v v v 0.8038 ‘4 0.0125 
EdNet v v ¥v _-0.8050—s t: 0.0137 
EdNet v v v v 0.8059 + 0.0146 
RAIEd2020 0.7956 - 
RAIEd2020 v 0.7991 ‘+ 0.0035 
RAIEd2020 v 0.8020 + 0.0064 
RAIEd2020 v 0.7965 ‘+ 0.0009 
RAIEd2020 ¥v 0.7977 =F 0.0021 
RAIEd2020 v v 0.8031 ‘+ 0.0075 
RAIEd2020 v v 0.8027 + 0.0071 
RAIEd2020 v v v 0.8035 ‘+ 0.0079 
RAIEd2020 v v ¥v 0.8051 = 0.0095 
RAIEd2020 v v v v 0.8056 + 0.0100 


performance metric, which has been widely used in many 
other KT-related proposals. 


For the ease of explanation, hereinafter Base Modification 
(Section}2.1), Pivot Module (Section|2.3) and Leveled Learn- 
ing (Section |2.4) would be abbreviated as BM, PM and LL 
respectively. 


3.2 Results And Analysis 

The overall experimental results of different KT methods 
on different datasets were illustrated in Table Because 
we had successfully reproduced the performance of SAINT 
and SAINT-+ that was previously reported in SAINT+’s pa- 
per (with considerable precision), AUCs of other models 
are therefore directly cited from the paper (labeled with sub- 
script r). 


From the comparison table, it can be seen that in both Ed- 
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100% 


Life Process of student #3 


0% 


Figure 5: The visualization of intermediate features in 
SAINT+ (a) and in LANA (b). Compared to (a), stu- 
dents in (b) (different colors) are notably clustered (marked 
arrows). The learning process of student #3 overtime in 
SAINT+ (c) and in LANA (d). compared to (c), a clear 
learning path appeared in (d). 


Net and RAIEd2020 datasets, LANA (marked bold) outper- 
forms the previous SOTA method (marked italic) by 1.46% 
and 1.00% respectively, readily verifying the effectiveness 
of our proposed improvements. Moreover, LANA also sur- 
passes SAINT+ & BM by 1.24% and 0.65% respectively, 
suggesting adaptability contributes most to LANA’s AUC 
increment. Considering experimented datasets are by far the 
two largest knowledge tracing datasets in the world, these 
results undoubtedly provide strong evidence of the validity 
of the proposed LANA method. 


3.3 Ablation Studies 


In this section, we investigated the effectiveness of each of 
our proposed improvements: BM that customizes the basic 
transformer architecture, PM that enables the decoder to 
be adaptive to the students’ personal characteristics, and 
LL that interpretably specializes encoders and SRFEs for 
better predicting performance. The results of the ablation 
study were shown in Table [3| 


The table shows in EdNet, applying BM alone was already 
capable of improving the predicting AUC by approximately 
0.2% averagely, verifying the importance of both the action 
of positional embedding and the personalized linear projec- 
tion for each input feature in KT’s context. Meanwhile, 
applying LL solely can benefit the model performance as 
well, by generally 0.2% compared to 0.1% with the vanilla 
ensemble. Considering without PM, LL would just per- 
form fitting on students with different ability levels, the 


performance gain from sole LL could be interpreted as re- 
ductions in students’ inherent properties gaps. Moreover, 
BM + PM drastically boosts the model performance by 
nearly 1.25%, suggesting PM makes proper use of extracted 
student-related features from SRFE to adaptively reparame- 
terize the model’s decoder for different students at different 
stages, and therefore contributes most to the final perfor- 
mance gain. Finally, by combining all improvements to- 
gether, BM + PM + LL (i.e. LANA) achieves a final AUC 
of 0.8059, substantially outperforms previous SOTA by at 
least 1.46%. 


3.4 Features Visualization 

For vividly illustrating the validity of student-related fea- 
tures distilling in LANA, 20 students’ intermediate features 
from PC-FFN module was sampled to generate Figure [| by 
t-SN Ef]. In figure[5](a) and (b), each sample represents in- 
termediate features of different students with different colors 
in SAINT+ and LANA respectively. It can be seen that in 
SAINT-+, samples are almost randomly distributed, indicat- 
ing the correlation between samples of the same student is 
not more significant compared to samples of the others due 
to the ignorance of students’ personalities. On the other 
hand, in LANA, clusters (marked arrows) of samples have 
notably appeared in comparison to (a). Thus, we concluded 
that LANA is capable of successfully extracting student- 
related features from their interactive sequences, summariz- 
ing the similarities and differences, which eventually results 
in more distinguishable features for the final classifier. 


Furthermore, we individually visualized student #3’s (ran- 
domly picked) samples along the time axis to investigate the 
transitioning pattern of features in Figure[5](c) (SAINT) and 
(d)(LANA). In (c), there is no clear pattern in the change 
of features over time, while in (d), a clear transitioning path 
could be noticed. Since many other students are sharing the 
same pattern in LANA, we argue that it represents the tra- 
jectory of the student’s ability changes with more and more 
exercising. Namely, it is the learning path of the student. 
Consequently, we contended that it is potentially helpful 
for other applications, such as learning stages transfer and 
learning path recommendation. 


4. CONCLUSION 


In this paper, we proposed a novel Leveled Attentive KNowledge 


TrAcing (LANA) method that was committed to bringing 
adaptability back to DKT. Instead of directly learning the 
model parameters of different students, LANA distills stu- 
dents’ inherent properties from their respective interactive 
sequences by a novel SRFE, and learns the function to repa- 
rameterize the model with these extracted student-related 
features. Consequently, innovative pivot module was pro- 
posed to produce an adaptive decoder. Besides, a novel 
leveled learning training mechanism was introduced to clus- 
ter students by interpretable Rasch model defined ability 
level, which not only specializes the encoder and therefore 
enhances the significance of students’ latent features, but 
also saves much training time. Extensive experiments on 
the two largest public benchmark datasets in the education 
domain strongly evaluate the feasibility and effectiveness of 
the proposed LANA, features visualization also suggests ex- 
tra impacts of LANA, be it learning stages transfer or learn- 
ing path recommendation. 
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ABSTRACT 


Automatic discovery of information in educational data has 
been broadening its horizons, opening new opportunities to 
its application. An open wide area to explore is the recom- 
mendation of undergraduate programs to high school stu- 
dents. However, traditional recommendation systems, based 
on collaborative filtering, require the existence of both a 
large number of items and users, which in this context are 
too small to guarantee reasonable levels of performance. 


In this paper, we propose a hybrid approach, combining col- 
laborative filtering and a content-based architecture, while 
exploring the hierarchical information about programs or- 
ganization. This information is extracted from courses pro- 
grams, through natural language processing, and since pro- 
grams share some courses, we are able to present recommen- 
dations, not just based on the performance of students, but 
also on their interests and results in each of the courses that 
compose each program. 


Keywords 
Recommendation systems, higher education programs, edu- 
cational data mining 


1. INTRODUCTION 


Nowadays, it is common to have teenagers applying to a 
higher education program after finishing their high school. 
Every year, new programs appear and thousands of candi- 
dates must choose which one is the best for them. 


This type of problem is very well-known in Educational 
Data Mining and in Recommendation Systems community 
[3, 11]. This past decade, many studies were made on cre- 
ating engines that help students in choosing the courses 
that are suited for them, using different approaches, like 
content-based or collaborative filtering recommendation sys- 
tems. The last type is the most used due to the large amount 
of data community can give. 
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Despite courses recommendation being a more studied prob- 
lem, we want to apply these systems to programs recommen- 
dation that is not very researched yet. This brings an impor- 
tant challenge, since courses recommenders have already the 
target user inside the system rating previous courses among 
the others students, and in our problem candidates did not 
rate anything to be compared to other users in first hand. 


Considering all of these aspects, our work aims for creat- 
ing a recommendation system that will receive candidates 
personal data and high-school academic records, with the 
proper consent given by them considering general data pro- 
tection regulations (GDPR), and will output the programs 
that most fit to their profile, comparing to the current stu- 
dent community. The system will consider the personal 
characteristics of the students as a matching measure and 
the programs’ courses, objectives and description to find 
keywords that define the corresponding programs. These 
keywords will allow to compute ratings for every program 
considering the academic marks of the students on their own 
program. 


This paper is divided in four more sections. Literature re- 
view covers the basic aspects of recommendation systems, 
with special focus on their use for educational purposes. Af- 
ter this, we present the architecture of our system that can 
be applied at a common university structure. After sys- 
tem architecture, current results are shown, followed by the 
reached conclusions at this time. 


2. LITERATURE REVIEW 


Recommendation Systems (RS) are software tools and tech- 
niques that provide suggestions for items to be of use to a 
user [10]. A RS can be exploited for different purposes, such 
as, to increase the number of items sold, to better under- 
stand what the user wants or, in another point of view, to 
recommend a specific item to that user. 


There are two main types of recommendation engines, Content- 


based and Collaborative Filtering. The first one is focused 
on item similarities, and the second one use past behaviors 
of users to recommend items to the active user [1]. 


There is also a third type of recommendation systems, knowledge- 


based approaches where recommendations are given based 
on explicit specification of the kind of content the user wants. 
These systems are very similar to content-based ones, but 
with domain knowledge input. Finally, a hybrid recommen- 
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dation system is constructed if there is a combination of 
two or more RS philosophies in order to improve the global 
performance. 


Over the years, a large amount of educational data is being 
generated and there are being applied more collaborative fil- 
tering approaches than content-based methods in this area. 


Morsomme and Alferez proposed a collaborative recommen- 
dation system that outputs courses to the target users, by 
exploiting courses that other similar students had taken, 
through k-means clustering and K-nearest neighbors tech- 
niques [2]. 


A recommendation system for course selection was devel- 
oped in Liberal Arts bachelor of the University College Maas- 
tricht [6], using two types of data, students and courses. 
Student data consisted of anonymized students’ course en- 
rollments, and course data consisted of catalogues with de- 
scriptions of all courses, which allowed to find the topics of 
each one, using the Latent Dirichlet Allocation statistical 
model. Recurring to regression models of student data, the 
authors could predict his grade for each course. In the end, 
the system outputs 20 courses whose content best matches 
the user’s academic interest in terms of Kullback-Leibler dis- 
tance. 


This content-based approach was applied as well in Dublin 
[8], where the authors used an information retrieval algo- 
rithm to compute course-course similarities, based on the 
text description and learning outcomes of each one. 


In Faculty of Engineering of the University of Porto, it was 
created an engine to help students choosing an adequate 
higher education program to access a specific job in the fu- 
ture [5]. Therefore, it was implemented a recommendation 
system that uses the data from alumni and job offers and 
outputs a ranking of programs that could lead to the candi- 
dates’ desired careers. The collaborative filtering approach 
can match the skills needed for that job and the skills given 
to the students of a specific degree. 


Fabio Carballo made an engine that predicts students mas- 
ters courses marks, using collaborative filtering methods, 
singular value decomposition (SVD) and as-soon-as-possible 
(ASAP) classifiers. With his work, he could recommend the 
more suitable program for students skills [4]. 


The topic around course and programs recommendations 
gained even more attention recently, with several published 
studies in the last years, following a variety of approaches 
[13, 14, 7, 9, 12]. 


3. RECOMMENDATION SYSTEM 


The proposed system shall enlighten candidates about the 
degrees that are more compatible with their interests and 
that were successfully concluded by similar students, using 
a hybrid approach. 


Our system must recommend higher education programs to 
a specific high-school student who wants to enroll at uni- 
versity. Usually, the candidate searches information about 
each program at universities web pages, such as courses or 


professional careers, or talks with students who are already 
enrolled at the programs he or she likes. The process of 
choosing a degree is very important to a high school student 
and it must be done analysing all the information available. 
Therefore, the main use case of our system focuses on can- 
didates point of view. 


As we can see on Figure 1, when the candidate uses our sys- 
tem, he or she must be able to give personal data that will be 
considered during the recommendation process. After that, 
the system must output a ranking of the programs that are 
most suitable to the candidate. Candidate’s personal data 
can be academic interests, high school grades, personal data, 
such age or gender, among others. Since we are collecting 
data, it must be made according to the GDPR, applying 
anonymization techniques when necessary. 


System Boundary 
Student 
Upload 
Grades 
Upload 

Programs 
Update 
Courses 
Update 


Figure 1: Use cases of the system. 


System Boundary 
Give Personal 
t Data 
; Receive 
Candidate Recommendations 


Admin 


Looking at the system from Admin point of view, there are 
several tasks he or she must be able to do, as we can see 
on Figure 1. System Administrator is the one responsible 
for system updates: upload new students data every year, 
upload students grades at the end of each semester, and 
update programs and courses when necessary. All the es- 
sential data to relate the candidate to current students and 
to make proper recommendations must be inputted before 
the system launching. 


Finally, analysts staff can use this system when useful, to get 
a summary of student community and a characterization of 
new students. 


3.1 Architecture 

The overview of our system architecture can be seen on Fig- 
ure 2, where we can distinguish two main modules: Students 
Profiler and Programs Recommender. 


Candidates start using our system by inputting their per- 
sonal data that will be used to find their profile. Current 
students data allow us to compute candidate profiles that 
will feed the second module. Programs Recommender uses 
the previous output to estimate a program success measure 
considering estimated grades, returning in the end a ranking 
of the most suitable programs to the candidates. 


Candidate Students Candidate Programs Recommended 
Data Profiler Profile Recommender Programs 


Figure 2: Proposed architecture. 
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3.1.1 Students Profiler 


There is a major difference between our recommendation 
system and the common ones, where the target user is inside 
the system among the others. Here, the target user candi- 
date is not in the system, since he or she is not enrolled at 
a higher education degree yet, and therefore can not rate 
programs or take courses. Hence, it must be developed a 
strategy where we can compare users. 


Students Profiler computes the candidate profile as if he or 
she was inside the system, by comparing him with current 
students using shares personal variables. Therefore, the first 
step was to collect these data and to build a students profil- 
ing model, where we performed a feature engineering study. 


A simple choice to implement Students Profiler is to apply 
the K Nearest Neighbors (KNN) method, after choosing the 
best similarity measure and number of neighbors, kK. To 
compute the similarity, we used five different measures which 
results were studied. We also tuned the KNN process as well 
by trying to find the optimal value for K, that was the one 
having the minimum error rate. 


In the end, Students Profiler returned candidate profile that 
will be fed to the next module. 


3.1.2 Programs Recommender 

Programs Recommender module is a more complex one which 
aims at finding the ranking of the best programs to the can- 
didates, considering their profile and interests. 


In order to reach its goal, this module has to create two mod- 
els. The first one, called Grades Model, for estimating the 
candidate performance in each possible academic units, and 
the second one, the Ranking Model, for mapping students 
to programs. 


As usual, the Grades Model is constructed by following a col- 
laborative filtering approach, meaning that it uses a singular 
value decomposition (SVD) matrix factorization. This fac- 
torization performs a feature extraction step, reducing the 
number of elements to the minimum required for estimat- 
ing students grades. When in the presence of the candidate 
profile, the Grades Model is applied to estimate the candi- 
date grades. Using the candidate profile, instead of its orig- 
inal data, is the first difference in our approach, but there is 
more, achieved through the use of a content-based approach. 


RS usually deal with a very large number of items, but the 
number of programs available in any university is just a few, 
when compared. Additionally, each student is enrolled on 
just one program, which means that our grades matrix would 
be very sparse, not contributing for a good recommendation. 
A third aspect is that programs share some courses (for ex- 
ample all engineering students study Physics and Maths, 
while all art students study Drawing and Geometry). But 
we can go a step further, and understand that courses cover 
some topics present in different areas. For example, several 
engineering courses study systems, their architecture and 
their dynamics. 


The third proposal is the possibility of dealing with the aca- 
demic units at different levels of granularity: we can aggre- 


gate everything to recommend programs, or we can simply 
identify a ranking of topics that are recommend for the can- 
didate. This ability is very important to reach a new level 
of explainability, so needed in the field. 


4. PRELIMINARY RESULTS 


A recommendation system validation is a hard task to take. 
In contexts, like education, where these systems can not be 
made available before being proved ‘correct’, this task is 
even harder. 


In our case, we made use of students data collected at the 
time of their enrollment in the university, to mimetize can- 
didates surveys. Then, we used students data from 2014 
to 2018 for training and data from 2019 for evaluation pur- 
poses. Moreover, every model of the system has to be vali- 
dated independently, in order to better estimate each com- 
ponent performance, and only after tuning each of them 
evaluate its global quality. 


We started by evaluating the Students Module, which has 
the use of KNN to estimate candidate profile on its basis. 
As data sources for this phase, we had personal data from 
7918 students and grades from 7302 students, that resulted 
in a dataset of 7300 instances by intersecting the first ones. 
This dataset is composed by 101 variables, where enrolled 
program is the only categorical one, all of the others are 
numeric. Note that, we had no missing values on the dataset. 


In this module, we wanted to find the K students that are 
most similar to the candidate. Therefore, we made a study 
to find the best pair (K, similarity measure) mimetizing a 
KNN performance study, but without focusing on the clas- 
sification task. First, we needed to define which condition 
must students achieve to have success on their program, 
based on their Grade Point Average (GPA), from a 0-20 
scale. Hence, a histogram was made and it is shown at Fig- 
ure 3. 


Figure 3: Number of students by each GPA class. 


Since the average of students GPA is 12.99, we labeled as 
having success students which GPA was equal to 13 or more, 
and not having success otherwise. This way we guaranteed a 
balanced dataset. After the labelling, we computed ten trials 
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of data train-test split for five similarity measures (cheby- 
shev, correlation, cosine, euclidean, and manhattan) and for 
K between 5 and 155 in multiples of 5. For each pair (K, sim- 
ilarity measure), we computed the average of KNN model 
accuracies, since 70% train and 30% test datasets are ran- 
dom in each trial. The results are shown in Figure 4, and 
zoomed in Figure 5. 


Distance 
chebyshev 
correlation 
cosine 
euclidean 


manhattan 


5 


Figure 4: K and similarity measure study. 
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Figure 5: Zoom of K and similarity measure study. 


Students Profiler module has five conditions that will be 
tested in the global system: (120, chebyshev); (100, correla- 
tion); (90, cosine); (30, euclidean) and (20, manhattan). 


We implemented as well a simple recommendation system 
where we used the candidate profile, composed by the av- 
erage grades of all neighbors for all courses taken by them, 
to predict the candidate grades for all available courses. In 
this component, four conditions were used for testing the 
system behaviour for all similarity measures: A) using SVD 
as matrix factorization technique with the K values men- 
tioned above and considering all the variables from students 
data; B) same as A), but using K equals to 5; C) using SVD 
with the best K values predicted using a reduced students 
dataset with only academic records; and D) same as C) but 
using the Slope One prediction method. 


After that, we used 1509 candidates to test the system, 
where we computed the GPA that each of them would have 
in each one of the available programs using their predicted 
course grades and ranked them by GPAs. Then, we com- 
puted the mean absolute error for those which first recom- 


Table 1: Mean Absolute Errors for each prediction method 

and for each similarity measure 

Similarity A B Cc D 

Measure 

chebyshev | 2.065 | 2.378 | 2.139 | 2.289 | 

correlation | 2.149 | 2.580 | 2.359 | 2.325 | 

cosine 2.153 | 2.583 | 2.430 | 2.451 | 
| 
| 


euclidean | 2.538 | 2.497 | 2.153 | 2.289 
manhattan | 2.313 | 2.488 | 2.376 | 2.451 


mended program coincides with their current program in 
terms of GPA, and results are showed in Table 1. 


The next steps will consist of improving the way we recom- 
mend the programs and its ranking model, considering dif- 
ferent ensembles, namely random forests and gradient boost- 
ing. At this time, we are predicting GPA with almost 90% 
accuracy. 


5. CONCLUSIONS 


The current educational context, even more after the begin- 
ning of the pandemic situation, demands new educational 
systems. Systems able to address the difficulties inherent 
to distance learning contexts, where students are far from 
educators, and plenty of times try to follow their path with- 
out any guidance. Most of the times, online education tools 
deal with students in a ‘one-fit-all’ approach, that ignore 
each students preferences. 


In this paper, we propose a new architecture for a recommen- 
dation system, designed for suggesting programs to univer- 
sity candidates. Our system benefits from an hybrid archi- 
tecture, that combines collaborative filtering with a content- 
based philosophy, exploring the full documentation of pro- 
grams and courses available. Additionally, we explored the 
notion of feature stores to easily update the data repositories 
to support our system. 


The proposed architecture is adaptable to smaller contexts, 
for example for suggesting learning resources at any abstrac- 
tion levels, such as exercises. 
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ABSTRACT 


Essays test student knowledge on a deeper level than short- 
answer and multiple-choice questions but are more labori- 
ous to evaluate. Automatic clustering of essays, or their 
fragments, prior to evaluation may reduce the manual effort 
required. Such clustering presents numerous challenges due 
to the variability and ambiguity of natural language. In this 
paper, we introduce two datasets of undergraduate student 
essays in Finnish, manually annotated for salient arguments 
on the sentence level. Using these datasets, we evaluate sev- 
eral deep-learning embedding methods for their suitability 
to sentence clustering in support of essay grading. We find 
the suitable method choice to depend on the nature of the 
exam question and the answers, with deep-learning methods 
being capable of, but not guaranteeing better performance 
over simpler methods based on lexical overlap. 


Keywords 
deep learning, essay clustering, text similarity, paraphrase, 
grading support 


1. INTRODUCTION 

Essay-type questions have been shown to help with the re- 
tention of learned material but are time- and labour- 
consuming to evaluate. Computational methods can be used 
to grade essays, or to assist in their evaluation. Examples of 
the latter approach include pre-processing to show statistics 
of student answers such as average answer length and key- 
words fq], comparing student answers to a given text fr], 
generating word clouds of student answers (6), and grouping 
student answers into clusters of similar answers [2]. Most of 
these systems target the pre-processing and analysis of short 
answers, and less effort has been dedicated to computer- 
aided assessment of longer essays. One approach to reducing 
human effort in fact-based student essay assessment compu- 
tationally would be to identify similar arguments in student 
essays. This approach draws inspiration from qualitative re- 
search methods where interviews are first transcribed verba- 
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tim, and categories are then formed and themes are created 
[5]. By identifying recurring arguments across a cohort of 
essays, it is expected that human grading effort could be re- 
duced, much like the analysis of interviews is made simpler 
after forming categories. 


In this paper, we evaluate the applicability of several rep- 
resentative deep learning methods to the task of identify- 
ing distinctly-phrased, but semantically near-equivalent seg- 
ments of student essayq!| We approach the task from two 
angles. As an information retrieval (IR) problem, whereby 
given a query text, e.g. a reference answer or an essay, the 
task is to retrieve the matching essays from the cohort, 
and establish their mutual correspondence down to sentence 
level. The other approach is that of clustering, where the 
objective is to discover groups of sentence-long segments 
with same meaning in the essay cohort. We test several 
algorithms, including TF-IDF |7], LASER [1], BERT [4], 
and Sentence-BERT [13]. To evaluate these algorithms, we 
gather and annotate two sets of factual essays written in 
exams by Finnish university students. 


2. DATASETS 


We collected Finnish essays written by bachelor’s level stu- 
dents as answers to exam questions. Two sets of essays re- 
plying to questions from two courses were selected for man- 
ual annotation. The annotator was a PhD student from a 
different discipline than the domain of the essays. The goal 
of the annotation was to identify similar arguments in sep- 
arate essays. The data were annotated by cross-referencing 
the arguments found in every essay, and assigning textual la- 
bels to recurring arguments or concepts on a sentence level. 
Specifically, all essays were first segmented into sentences, 
and each sentence was then assigned zero or more textual 
labels representing its content. If an argument appears more 
than once, it is given a distinct label which is assigned to 
all sentences containing that argument. For an argument to 
be considered recurring, the two sentences are required to 
clearly aim to communicate the same information about a 
common subject matter. An example of two sentences that 
are considered to have the same argument (on the pros and 
cons of group interviews in research): “It is not the quieter 
and more timid individuals that come out, but the loudest 
ones come to the fore.” and “In a group interview, there is a 
danger that some will talk too much and some will not have 
a turn to speak at all.” Both of these sentences describe 


'We refer readers to https: //arxiv.org/abs/2104. 11556 


for a more detailed version of the paper 
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Table 1: Dataset statistics 


Research | Accounting 
methods standards 
No. of essays 47 10 
Total no. of sentences 486 158 
No. of labels 59 34 
Avg. no. of labels per sentence 1.29 0.82 


the imbalance of expression of opinions in group interviews. 
In the next example, however, the two sentences are consid- 
ered to have different arguments, despite both of them being 
related to the role of trust in interviews. “In interviews, a 
trusting relationship must be established between the inter- 
viewee and the interviewer, which can be challenging.” and 
“Tf the interviewee remains anonymous, one can also openly 
discuss more sensitive topics, especially when one is alone 
with the interviewer.” This is because the two sentences 
make opposing arguments: the former takes a positive per- 
spective towards the role of trust in interviews, while the 
latter views it as a challenge. Clearly, these communicate 
different information. For each dataset, the number of labels 
thus depends on the number of recurring arguments in the 
essays, and the annotation scheme differs. We estimate that 
the development of the annotation scheme and the annota- 
tion effort required about two person-weeks in total. We 
note that we do not expect to annotated all sets of essays 
that are to be evaluated. Instead, these two sets of annota- 
tions serve as benchmarks for testing ideas on automatically 
assisting essay evaluation. The two resulting datasets are 
introduced below. The key statistics of the two datasets are 
summarized in Table[i]and the distribution of the labels in 
the two datasets is illustrated in F igure [I] in the Appendix. 


2.1 Research methods dataset 


The first dataset is created from student essays from the 
course “Research process and qualitative research methods” 
(henceforth Research methods). The essays answer the ques- 
tion, “Consider the positive and negative aspects of inter- 
views”. Several main points are frequently mentioned by 
students: for example, almost all students discussed how 
time consuming interviews can be (label time_consuming). 
93% of the dataset sentences have at least one label, indi- 
cating that the great majority of sentences involve at least 
one argument repeated in other essays. 


2.2 Accounting standards dataset 

The second dataset consists of student essays from the course 
titled “IAS/IFRS accounting standards” (henceforth Account- 
ing standards). The essay prompt is “What are the compo- 
nents of IFRS financial statements? Consider the signifi- 
cance of the various components in the light of the qualita- 
tive criteria for the financial statement information”. The 
label distribution of this dataset is more even, and almost a 
third of the sentences do not have a label. This may be due 
to the fact that there are fewer essays in this dataset. This 
implies that given one main argument, it is less likely that 
the argument is also mentioned by somebody else. 


3. SENTENCE REPRESENTATIONS 


To identify sentences with similar arguments, we consider a 
set of methods for representing each sentence with a vector, 


which allows efficient computation of sentence similarity via 
the similarity of their vectors. As baselines, TF-IDF vectors 
and average of word embeddings are used for sentence repre- 
sentation. For deep learning methods, the encoders LASER, 
BERT, and Sentence-BERT are tested. The distance mea- 
sure used is the cosine similarity between two sentence vec- 
tors, a standard metric applied also in previous studies. 


3.1 TF-IDF 


Term frequency—inverse document frequency (TF-IDF) is a 
family of popular IR metrics that estimate the importance 
of a given word in a document from a document collection 
based on the number of times the word appears in the doc- 
ument (term frequency) and the inverse of the number of 
documents the word appears in (document frequency) [7]. 
TF-IDF can be applied with words or character sequences. 
For this baseline, all the tokens in a sentence are first leomma- 
tized using the Universal Lemmatizer [3]. Character ngrams, 
specifically bigrams, trigrams, 4-grams and 5-grams, are cre- 
ated out of text inside word boundaries. We note that the 
TF-IDF encoding generates sparse high-dimensional vectors 
where there is no inherent similarity between words. 


3.2 Average of word embeddings 

This baseline represents each sentence using the average 
of the vector representations of the words in the sentence. 
We use the Finnish word embeddings created by Kanerva 
et al. (9] and refer readers to this paper for further details 
of the embeddings. These embedding were induced using 
the implementation of the skip-gram algorithm in the 
word2vec software package on Finnish Common Craw! data. 
The average of word embeddings produces dense, compara- 
tively low-dimensional representations that can capture the 
similarity between words, but the representation of words is 
independent of the context they appear in. 


3.3. LASER 


The Language-Agnostic SEntence Representations (LASER) 
released by Facebook is a sentence embedding method that 
aims to achieve universality with respect to language and 
NLP task. The encoder can encode 93 languages, all of 
which share a byte-pair encoding vocabulary. The en- 
coder consists of a BiLSTM with max-pooling operation, 
coupled with an LSTM layer during training on parallel cor- 
pora [i]. LASER produces dense, low-dimensional represen- 
tations that can capture the contextual meaning of words. 


3.4 BERT 


Bidirectional Encoder Representations from Transformers 
(BERT) introduced by Google is a deep contextual language 
representation model [4]. The training objectives of BERT 
make them cross-encoders, i.e. the model takes in a pair of 
sentences at a time. However, we encode one sentence at 
a time and use the mean-pooling of the resulting outputs 
as the sentence representation. We use the uncased vari- 
ant of FinBERT, a monolingual Finnish BERT Base model 
that has been demonstrated to provide better performance 
in Finnish text processing tasks than multilingual BERT 
{16}. Like LASER, BERT produces dense, low-dimensional 
representations that account for context. 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 615 


Table 2: Results of the IR evaluation 


Accounting | Avg | Avg Avg Avg | MRR | MAP 
standards First | Med | Mean | Last 
TF-IDF 4% 9% 11% 24% 0.47 0.48 
word2vec 6% 17% | 20% | 40% | 0.47 0.34 
LASER 4% 13% 15% | 33% 0.53 0.42 

BERT 5% 15% 17% | 37% 0.53 0.41 
SBERT 5% 11% 14% | 31% 0.46 0.42 
Research Avg | Avg Avg Avg | MRR | MAP 
methods First | Med | Mean | Last 
TF-IDF 1% | 18% | 24% | 72% | 0.46 | 0.28 
word2vec 2% 26% 31% | 79% 0.34 0.19 
LASER 2% 19% 26% | 73% 0.42 0.23 
BERT 1% 17% 23% | 70% 0.49 0.28 
SBERT 2% 17% 22% 65% 0.43 0.28 


3.5 Sentence-BERT 

Sentence-BERT (SBERT) trains BERT models using Siamese 
and/or triplet networks to induce a single-sentence encoder 

specialized for cosine-similarity comparison |13]. We obtain 

machine translated versions of the SNLI and MNLI 

corpora using the English to Finnish Opus-MT model {15). 

Finnish SBERT is subsequently trained from Fin BERT-base- 
uncased using these two natural language inference corpora. 

Specifically, the model is fine-tuned for an epoch with learn- 

ing rate 2e-5 and batch size of 16, with mean pooling as the 

pooling method. The representations produced by SBERT 

are dense, low-dimensional, and context-sensitive, like those 

of LASER and BERT. 


4. EVALUATION 


The sentence representations are evaluated from IR and clus- 
tering perspectives. Six evaluation metrics are used for the 
IR approach: two well-known metrics mean reciprocal rank 
(MRR) and mean average precision (MAP), and four met- 
rics tailored to our specific task setting. average of highest 
rank (Avg first), average of median rank (Avg med), average 
of mean rank (Avg mean), and average of lowest rank (Avg 
last) measure the rank of the highest, median, mean, and 
lowest rank of the relevant items respectively, as percent- 
age of the whole (0% first rank, 100% last rank), averaged 
over all items. These four metrics give more insight into the 
distribution of the relevant retrievals by measuring where, 
on average, the first, median, mean, and last relevant items 
are ranked. Since some sentences have more than one label, 
sentences with at least one overlapping label are considered 
relevant retrievals for all metrics. 


The clustering evaluation measures how well the clustering 
induced by the vector embeddings corresponds to the clus- 
tering induced by the sentence labels. Cluster accuracy is 
based on the most frequent label of a cluster: for each clus- 
ter, the majority label is obtained from the ground truth an- 
notations of the sentences in the cluster. A sentence is con- 
sidered to be correctly clustered if it has the majority label of 
its cluster as one of its labels. The number of correctly and 
incorrectly clustered sentences can then be interpreted as 
an accuracy percentage. We note that random baseline per- 
formance varies drastically between different datasets with 
this metric, so accuracy values are not directly comparable 
between datasets. Adjusted Rand index and adjusted mu- 
tual information are established clustering metrics. We use 
sampling to work around the multi-label nature of the an- 


Table 3: Results of the two clustering evaluation methods. 
Average adjusted Rand (Avg adj. Rand), Average adjusted 
mutual information (Avg adj. mutual info.), Cluster accu- 
racy (Clus. acc.), Standard deviation (Std dev). 


Accounting | Avg Std | Avg adj. Std | Clus. 
standards adj. dev mutual dev acc. 
Rand info. 
TF-IDF 0.31 0.02 0.33 0.02 | 73% 
word2vec 0.18 0.02 0.23 0.02 | 69% 
LASER 0.21 0.01 0.27 0.01 | 72% 

BERT 0.21 0.01 0.27 0.02 | 72% 
SBERT 0.28 0.01 0.33 0.02 | 73% 
Research Avg Std | Avg adj. Std | Clus. 
methods adj. dev mutual dev acc. 
Rand info. 
TF-IDF 0.12 0.01 0.22 0.01 | 55% 
word2vec 0.05 0.00 0.13 0.01 | 41% 
LASER 0.08 0.00 0.17 0.01 | 46% 

BERT 0.11 0.01 0.23 0.01 | 50% 
SBERT 0.11 0.01 0.22 0.01 | 51% 


notations: For each sentence with multiple labels, one label 
is randomly chosen. Then the clusters are evaluated against 
these labels with the two metrics. This process is repeated 
50 times and the values of the metrics are subsequently av- 
eraged. The resulting scores are between -1 and 1, and they 
are adjusted for chance, so that a random clustering has a 
score close to zero. We use the agglomerative clustering al- 
gorithm with ward linkage. Sentences that have no labels, 
ie. containing a unique argument, are each given a unique 
label for the purposes of the clustering evaluation, effectively 
each forming one singleton cluster. The resulting true num- 
ber of clusters (60 for the research methods dataset and 95 
for the accounting standards dataset) is the clustering model 
input. 


5. RESULTS 

The IR evaluation results are shown in Table We find 
that there is no single method that systematically outper- 
forms the others. Surprisingly, for the accounting standards 
dataset, the advanced methods fail to outperform the TF- 
IDF baseline, which achieves the highest results for all met- 
rics except MRR. This indicates that while TF-IDF is not 
the most competitive in consistently ranking relevant items 
at the highest ranks, it is able to concentrate relevant items 
towards higher ranks in general. This is particularly evident 
for the average of the lasts metric, where TF-IDF scores 7% 
points higher than the second best performer, SBERT. Here 
the number 24% indicates that, for the accounting standards 
dataset, TF-IDF on average ranks all the relevant items 
within rank 24 out of 100. The high performance of TF- 
IDF on this dataset may be attributed partly to the essay 
prompt requiring students to list the correct keywords. The 
elements of the IFRS financial statements are only so many, 
and these items cannot be paraphrased. Methods that com- 
pare strings directly thus outperform methods that use dense 
vector representations that approximate their meaning. 


The research methods dataset, however, does not have such 
a strong emphasis on exact keyword matching: there are 
no fixed numbers of keywords that have to be mentioned 
in the answers. Rather, the pros and cons of interviews as 
a research method are described, and thus sentences that 
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describe the same concept using different words are more 
likely to occur. On this dataset, considering the retrieval 
of the first relevant item, both TF-IDF and BERT perform 
best on the average of the firsts metric, while BERT per- 
forms best on the mean reciprocal rank. Since the average 
of the firsts metric is more lenient on lower rankings of first 
relevant items, we can infer that BERT performs more con- 
sistently on the retrieval of the first relevant item. Overall, 
BERT-based methods obtain better results, with SBERT in 
particular outperforming the other methods by 5% points on 
the retrieval of the last relevant items. BERT and SBERT 
both obtain the highest results on four out of six metrics. 


The results of the clustering evaluation are summarized in 
Table These results clearly tend towards the TF-IDF 
baseline, while the word2vec-based approach is the weak- 
est, tallying with the IR evaluation. Of the neural meth- 
ods, SBERT is particularly strong in the accounting stan- 
dards dataset, while being in line with BERT in the research 
methods dataset. Of the two sentence embedding methods, 
SBERT outperforms LASER in all tests. We are surprised 
to find that the TF-IDF model seems to be better suited to 
the clustering objective than the neural methods, and will 
examine this in future work. 


Overall, we find that the comparative ranking of the meth- 
ods varies strikingly depending on the dataset, evaluation 
setting, and metric. The dataset dependence may partly be 
explained by the nature of the arguments made: if the argu- 
ment is required to contain certain specific terms, TF-IDF 
can be a very strong method. On the other hand, if the argu- 
ment involves more abstract concepts that can be expressed 
in many ways, neural methods may have an advantage over 
methods that are based on exact string matching. While 
deep neural methods have led to breakthroughs in many 
NLP tasks, the gain they show here over the simple TF-IDF 
baseline is quite small even in the cases where they outper- 
form it. This may indicate challenges specific to the task and 
domain beyond those we have identified here, and calls for 
further research into the topic. This includes searching for 
more suitable encoding methods, improved evaluation meth- 
ods, and also study of how data should best be annotated 
to develop methods serving the needs of essay graders. 


6. DISCUSSION 


Our annotation makes at least two assumptions that call for 
further investigation: the sentence is the unit of annotation, 
and the labels are categorical and non-overlapping. Figure[]] 
shows that approximately 57% and 64% of the sentences in 
the Accounting standards and Research methods datasets 
(respectively) have exactly one label. Another 33% and 7% 
(resp.) of sentences do not have any labels. Since labels 
are only assigned if a main argument appears more than 
once, these sentences can be seen as singleton clusters with a 
label that occurs exactly once. With the current annotation 
granularity, the annotation is best applicable to cases where 
each sentence conveys a single main argument. However, the 
annotation statistics indicate that sentence may not always 
be the most suitable unit of annotation. These include cases 
where an argument is made across several sentences, and 
where a sentence makes several arguments. 


In addition to issues related to the sentence as a unit of anno- 


tation, there is also a degree of subjectivity to their labeling. 
For example, in the Research methods dataset, the two la- 
bels workload and time_consuming, which state that inter- 
views are labor-intensive and time-consuming respectively, 
could arguably be merged. For such boundary decisions to 
be helpful for essay graders, the marking criteria play a cen- 
tral role and there is no universal cut-off. As an alterna- 
tive to disjoint categorical labels, one could consider that 
the arguments (and the labels that represent them) can be 
organized hierarchically. For instance, in the research meth- 
ods dataset, the label interviewer_influence represents 
the argument that the stance of the interviewer may affect 
the research results, and the label unnatural_performance 
describes the affect of the interview situation on the per- 
formance of interviewees. On a higher level, both of the 
labels convey the research results being negatively affected 
by artificial factors. For these two datasets, the boundary 
decisions also depend on the sample size: if there are more 
essays, chances are that a small number of students make 
the exact same argument, in which case the boundary is 
unambiguous, or could be seen as a subcluster of a bigger 
cluster. We hope to address these and related challenges in 
future work. 


One focus of our ongoing work is the practical use of the clus- 
ters. An approach to capitalizing on these clusters would be 
to make them manually adjustable, i.e. examiners can adjust 
the contents of the clusters, create new clusters, and delete 
clusters. These clusters can then be color-coded or anno- 
tated with text, indicating whether the presence of a certain 
cluster is desirable in an essay. In addition, if reference an- 
swers are available, essays with more overlapping clusters 
with the reference answers can be automatically identified. 


7. CONCLUSIONS AND FUTURE WORK 


We focused on the task of computer-assisted assessment of 
comparatively long essays through the perspectives of IR 
and clustering. We have created two datasets based on 
two exam questions from different fields, on which we tested 
several deep-learning methods with respect to their ability 
to retrieve and cluster sentences containing the same argu- 
ments paraphrased. We found no method to be universally 
best; rather, the results depend on the nature of the es- 
says under assessment. Overall, the difference between the 
state-of-the-art deep learning methods and the much sim- 
pler TF-IDF baseline is not numerically large, leaving clear 
room for further development and application of more ad- 
vanced methods for embedding meaning. Developing such 
methods, as well as further practical testing of the approach 
constitute our future work. 
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ABSTRACT 


The extraction of sentiment from text requires many method- 
ological decisions to make inferences about mood, opinion, 
and engagement in informal learning contexts. This study 
compares sentiment software (SentiStrength, LIWC, tidy- 
text, VADER) on N = 1,382,493 tweets in the context of the 
Next Generation Science Standards reform (N = 546,267) 
and U.S. State Educational Twitter Hashtags (N = 836,226). 
Automated sentiment classifications were validated on N = 
300 hand-coded tweets. Additionally, we developed a dis- 
crepancy measure to identify tweet features associated with 
scale inconsistency. Results indicated that binary sentiment 
classifications (positive/neutral vs. negative) were more ac- 
curate than trinary classifications (positive, neutral, neg- 
ative). Combined tidytext dictionaries and VADER out- 
performed LIWC for negative sentiment, which was overall 
difficult to classify reliably while positive sentiment was clas- 
sified with high accuracy across all four dictionaries. Thus, 
researchers are encouraged to (a) consider employing overall 
sentiment scales or positive/neutral to negative ratios based 
on binary classification to characterize their sample, (b) ag- 
gregate multiple dictionaries or use domain-specific senti- 
ment dictionaries, and (c) be aware of the current limitations 
of detecting negativity through dictionary-based sentiment 
analysis in educational contexts. 


Keywords 


Social media data, sentiment analysis, online communities 


1. INTRODUCTION 


Sentiment analysis extracts positive and negative emotions 
from text. Its many applications include stock market pre- 
diction [22], marketing research [29], and, recently, inves- 
tigating public sentiment on educational reforms on Twit- 
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ter [32, 38]. Sentiment analysis typically requires numerous 
methodological decisions, such as deciding whether to use a 
dictionary-based or a supervised machine learning approach 
and determining how sentiment measures are suited to the 
investigation of a particular domain (e.g., VADER for social 
media data) [13, 30]. 


User-defined sentiment dictionaries (UDDs) rely on matches 
of word occurrences with a value in their dictionary, with lit- 
tle overlap often yielding less valid results. Whereas many 
sentiment measure validation studies investigate binary (i.e., 
positive and negative) sentiment classifications [1, 25, 28], 
less research has systematically compared trinary classifica- 
tions (i.e., positive, neutral, negative) and sentiment scales. 
Furthermore, as sentiment classifiers do not generalize well 
across domains [2], sentiment validation studies are needed 
to inform educational researchers utilizing the increased avail- 
ability of big data in education [7]. This study examines the 
performance of popular sentiment analysis methods in the 
context of a particular, social media-based data source: large 
education-related Twitter communities. 


The motivation of this study is two-fold. First, sentiment 
measures can give insight into how and why teachers en- 
gage in online communities on Twitter, a potentially novel 
form of informal teacher learning [6, 33]. Second, public 
opinion and sentiment can be viewed as a proxy for success- 
ful reform implementation [4, 27]. Wang and Fikis applied 
SentiStrength on more than 660,000 tweets related to the 
Common Core State Standards, finding sentiment, includ- 
ing that expressed by teachers, to be largely negative [38]. 
In contrast, Rosenberg et al. found largely positive sen- 
timent in 570,000 NGSS-related tweets through the same 
SentiStrength algorithm [32]. However, the validity of the 
utilized sentiment methods was not examined. 


2. RESEARCH BACKGROUND 
2.1 Sentiment Analysis Methods and Tools 


Sentiment analysis is frequently carried out through user- 
defined dictionaries (UDDs) [9]. UDDs contain sets of la- 
beled words that are rated on affect dimensions (e.g., va- 
lence, potency, activity) and matched to word occurrences 
in texts [23]. Researchers can either use pre-defined dic- 
tionaries or create their own dictionaries [9]. UDD methods 
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examine words individually, potentially neglecting figurative 
language and ambiguous phrases [36]. This study examines 
four popular examples of UDD software: (a) SentiStrength, 
(b) Linguistic Inquiry and Word Count (LIWC), (c) the R- 
package tidytext, and (d) the social-media attuned software 
VADER. 


SentiStrength outputs two truncated five-point scales [37] 
which is different from many other UDD implementations. 
It offers feature selection options and measures sentiment 
weight, which is the intensity or strength of positivity or 
negativity in a text, as opposed to simply comparing the 
frequency of sentiments in a text [14]. 


LIWC is possibly the most popular text analysis software 
[17]. It uses a well-validated default dictionary [16] and con- 
tains around eighty subdictionaries of topics for which it 
outputs individual scales [26]. LIWC has been used to in- 
fer psychological processes and constructs (e.g., emotional 
expressions) from text [36]. 


Tidytezt [34] does not provide its own default dictionary. 
At its core, it strives to pre-process input text which is then 
analyzed through any input dictionary [35]. Tidytext pro- 
vides functions for converting text into a “one-token-per- 
document-per-row” format which may ease text analysis. 


VADER (Valence Aware Dictionary for Sentiment Reason- 
ing) features multiple subdictionaries and considers word 
order and degree modifiers (e.g., ’very”, “slightly”, ”some- 
what”) [5]. It performs well in sentiment analyses of social 
media content (including from Twitter) while remaining ap- 
plicable to other contexts [5, 13]. That said, we found the R 
implementation of VADER to take around 80 times longer 
to compute compared to the other methods. 


2.2 Research Questions 

This study examined the validity of SentiStrength, LIWC, 
tidytext, and VADER in the context of educational Twitter 
data with the following research questions (RQs): 


RQ1: How valid are the employed sentiment measures with 
respect to human coding of sentiment? 


RQ2: How discrepant are sentiment scales and are correla- 
tions among scales consistent with these discrepancies? 


RQ3: Which features of texts (i.e., the number of words, 
likes, retweets, and context) account for scale discrepancy? 


3. METHOD 

3.1 Sample 

The study utilized tweets related to the Next Generation 
Science Standards reform and large educational state-wide 
hashtags (1,382,493 tweets, 156,446 users) posted between 
July 2008 and October 2020. Search terms included the 
#NGSSchat hashtag (N = 175,094 tweets, N = 67,060 of 
which being inside of designated chat-sessions), the terms 
“ngss” (without #NGSSchat, N = 312,167 tweets) or “next 
gen[eration] science standard|{s|” (N = 59,006 tweets). In ad- 
dition, we included tweets from 47 State Educational Twit- 
ter Hashtags (N = 836,226). Tweets not recognized as of 
English language by the Twitter API were omitted (5.0%). 


3.2 Sentiment Measures 

To investigate the validity of different sentiment measures, 
we used SentiStrength [37], LIWC [26], tidytext [34], and 
VADER [13] to obtain (a) binary and trinary classifications, 
(b) unidimensional (positive and negative) sentiment scales, 
and (c) a bidimensional sentiment scale rating for all tweets. 
While SentiStrength has binary and trinary classification 
methods, we subtracted negativity ratings from positivity 
ratings to obtain overall scores and defined a tweet as neu- 
tral if that overall rating was 0 (over 0 as positive, under 0 
as negative) for LIWC and tidytext. For tidytext, we used 
the NRC [24], Loughran-McDonald [20], AFINN [10], and 
Bing [12] dictionaries, standardizing ratings by the number 
of words in each tweet and averaging across available ratings. 
The remaining non-matches were assigned a 0. For VADER, 
we used its internal compound score as overall scale and 
classified tweets as neutral if that score was between -0.05 
and 0.05 (instead of 0) [13]. Binary classification combined 
positive and neutral tweets, such that neutral tweets were 
coded as positive, as done in previous validation studies [8] 
and since we observed that SentiStrength always classified 
tweets rated neutral in its trinary method positive in its bi- 
nary method. Additionally, we defined ambiguity measures 
for all sentiment dictionaries as the sum of the absolute val- 
ues of their positivity and negativity ratings. 


3.3 Additional Variables 


Continuous predictor variables included the number of likes, 
retweets, and words (excluding links and user mentions) of 
each tweet. To account for some features of the specific data 
sets we analyzed, we created a categorical predictor variable 
indicating whether a tweet was from the NGSS or SETHs 
data set (and, for the NGSS data set, whether the tweet 
was posted inside of #NGSSchat, designated chat-sessions 
of #NGSSchat, or included the term ”ngss”). 


3.4 Data Analysis 
3.4.1 Hand-coding and validation 


To provide a validation set of tweets to investigate how 
UDDs compare to human-evaluated sentiment (RQ1), two 
raters hand-coded 300 randomly sampled tweets on two 1-5 
scales for positivity and negativity, similar to SentiStrength. 
Our two raters reached a consensus of « = 0.728 for posi- 
tivity and & = 0.689 for negativity after coding 70 tweets 
independently, fulfilling common thresholds for satisfactory 
agreement [21]. After discussing and resolving any disagree- 
ments, an additional 230 tweets were coded independently. 
The binary and trinary sentiment classifications of human 
coders were assigned analogously to how they were created 
for the other UDDs. We calculated accuracy, precision, 
recall, and F-score for each category in each classification 
method (binary and trinary). 


3.4.2 Scale consistency and discrepancy index 

To quantify scale discrepancy for RQ2, we normalized the 
sentiment scales to M = 0 and SD = 1, accounting for Sen- 
tiStrength’s truncation of scales at |5| (contrasting LIWC, 
tidytext and VADER). As a discrepancy index, we calcu- 
lated the absolute difference between normalized scales for 
positivity, negativity, and overall scales for all six pairs of 
sentiment measures. For each tweet and scale type, the to- 
tal scale discrepancy was summed up and divided by the 
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number of comparisons. As a robustness check for our dis- 
crepancy measure, we calculated pairwise scale correlations 
between methods. 


3.4.3 Predictive modeling of scale discrepancy 

To examine RQ3, we conducted three ordinary least square 
linear regression models to predict discrepancy in the (a) 
positivity, (b) negativity, and (c) overall scales through var- 
ious tweet properties. Model assumptions (normal distri- 
bution of residuals, homoscedasticity, linearity assumptions 
and leverage) were investigated through graphical model 
tests in R. Robust standard errors (HC3 estimator [19]) were 
used to address residual heteroscedasticity for discrepancy in 
the positivity scales. Independent variables included tweet 
context and the number of words, likes, and retweets of a 
tweet. We also included binary classifications (0: negative, 
1: positive or neutral) to investigate whether scale discrep- 
ancies varied with sentiment polarity and ambiguity ratings 
to estimate whether tweets being high in positivity and nega- 
tivity were less consistently rated than tweets with less emo- 
tional valence. All independent variables had a generalized 
variance inflation factor (GVIF) of less than 5 [3]. 


4. RESULTS 
4.1 Validation of Sentiment Measures (RQ1) 


4.1.1 Dictionary coverage 

Coverage characterizes the fit of user-defined dictionaries 
with the data. Coverage is the relative frequency of texts 
that had a least one match inside a specific dictionary. We 
observed a coverage of 58.91% for SentiStrength, 55.7% for 
LIWC, and 67.7% for VADER. The combined tidytext dic- 
tionaries had a coverage of 84.9%. As subdictionaries, cover- 
age was 70.5% for NRC, 62.0% for AFINN, 56.9% for Bing, 
and 34.9% for the Loughran-McDonald dictionary. 


4.1.2 Hand-coded tweets and validation 

Comparing human coders with SentiStrength’s scale ratings 
(LIWC, tidytext and VADER do not output 1-5 scales; we 
describe these later in this section), we found a moderate 
two-way random effects ICC for absolute agreement [18] for 
both the positivity scale, ICC2k = 0.690 [0.57, 0.77] and the 
combined, overall scale, [CC2k = 0.683 [0.59, 0.75]. The 
negativity scale exhibited worse agreement, [CC2k = 0.448 
[0.31, 0.56]. Notably, Cohen’s kappa ratings were not sat- 
isfactory with & = 0.301 for positivity, « = 0.270 for the 
overall scale, and « = 0.183 for negativity [21]. 


Tables 1 and 2 describe the validity of the binary and tri- 
nary classifications for SentiStrength, LIWC, tidytext, and 
VADER. We found trinary classifications to have higher ac- 
curacy scores than binary classification (ranging from 85.00% 
to 88.33% and 56.33% to 67.00%, respectively). Notably, 
we found classifications of negative tweets to be less accu- 
rate than for positive tweets, with F-scores of tidytext and 
VADER (0.45 and 0.44, respectively) being higher compared 
to SentiStrength and LIWC (0.38 and 0.29, respectively). To 
test whether these differences were significant or random, we 
ran permutation tests with 250,000 simulations [39]. Tidy- 
text and VADER improved compared to LIWC, but not 
to SentiStrength, (albeit marginally) significantly (p = .058 
and p = .047, respectively), although only 11.67% of tweets 
were rated as negative by human coders. 


Table 1: Binary validation results 


SentiStr. LIWC 
Accuracy 85.50 88.33 

Pos/Neut Neg Pos/Neut Neg 
Precision 0.92 0.36 0.90 0.50 
Recall 0.91 0.40 0.97 0.20 
F-Score 0.91 0.38 0.94 0.29 
tidytext VADER 
Accuracy 87.00 88.33 

Pos/Neut Neg Pos/Neut Neg 
Precision 0.93 0.44 0.93 0.50 
Recall 0.92 0.46 0.94 0.40 
F-Score 0.93 0.45 0.93 0.44 


Note: Positive Tweets are either positive or neutral in bi- 
nary classification. Support: 265 Pos/Neut, 35 Neg 


Table 2: Trinary validation results 


SentiStr. LIWC 
Accuracy 66.00 67.00 

Pos Neut Neg Pos Neut Neg 
Precision 0.66 0.75 0.36 0.65 0.71 ~=0.50 
Recall 0.77 0.63 0.40 0.78 0.69 0.20 
F-Score 0.71 069 0.38 0.71 0.70 0.29 
tidytext VADER 
Accuracy 56.33 65.33 

Pos Neut Neg Pos Neut Neg 
Precision 0.50 0.88 0.44 0.59 0.79 0.50 
Recall 0.90 0.33 0.46 0.84 0.57 0.40 


F-Score 0.64 0.48 0.45 0.69 0.66 0.44 
Note: Support: 115 Pos, 150 Neut, 35 Neg 


4.2 Consistency of Sentiment (RQ2) 


4.2.1 Positivity scale 

For positivity scales, LIWC and VADER were the most con- 
sistent with each other based on scale correlation (r = .83) 
and mean discrepancy (0.41 SDs) followed by tidytext and 
VADER (r = .71, 0.53 SDs) and LIWC and tidytext (r = 
.63, 0.61 SDs). On average, positivity scales yielded pair- 
wise correlations of r = .63 and scale discrepancies of 0.60 
SDs (Table 3). 


4.2.2 Negativity scale 

For negativity scales, LIWC and VADER were the most con- 
sistent with each other based on scale correlation (r = .68) 
and mean discrepancy (0.33 SDs) followed by SentiStrength 
and LIWC if based on scale correlation (r = .61, 1.14 SDs) 
and LIWC and tidytext if based on scale discrepancy (r = 
.09, 0.59 SDs). On average, negativity scales yielded pair- 
wise correlations of r = .35 and scale discrepancies of 0.83 
SDs (Table 3). 


4.2.3 Overall scale 

For overall scales, LIWC and VADER appeared to be closest 
based on scale correlation (r = .69) and mean discrepancy 
(0.54 SDs) followed by LIWC and tidytext (r = .65, 0.59 
SDs), SentiStrength and VADER (r = .64, 0.65 SDs), and 
SentiStrength and LIWC (r = .56, 0.64 SDs), respectively. 
On average, overall scales yielded pair-wise correlations of r 
= .61 and scale discrepancies of 0.64 SDs (Table 3). 
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Table 3: Pairwise scale correlations (Corr) and discrepancy 
(Disc) for positivity, negativity, and overall scales of Sen- 
tiStrength (SS), LIWC (LI), tidytext (TT), and VADER (VA) 


Pos Neg Scale 

Corr Disc Corr Disc Corr Disc 
SS, LI 0.54 0.64 061 1.14 0.56 0.64 
LI, TT 0.63 0.61 0.09 0.59 0.65 0.59 
SS,TT 0.48 0.78 0.13 1.04 0.52 0.74 
SS, VA 0.59 0.66 0.58 1.19 0.64 0.65 
LI, VA 0.83 C041) s(0.68)S (0.33 (0.69 Ss(0..54 
TT, VA 0.71 0.53 -.01 0.69 0.60 0.65 
@ 0.63 0.60 0.35 0.83 0.61 0.64 


Table 4: Linear models predicting aggregated scale discrep- 
ancy measures; N = 1,382,493 


Predictor Pos Neg Scale 

(Intercept) -O.56*** = 1 5Q*** 0. 23*** 
Number of Words -0.00***  0.01*** — 0.01*** 
Number of Likes 0.00 0.00 0.00 

Number of Retweets 0.00***  0.00*** 0.00*** 
Context [#NGSSchat]  0.02*** -0.01*** — 0.03*** 
Context [SETHs] -0.01*** = 0.05*** = -0.02*** 
Context [Chat Hour] 0.00 -0.03*** — 0.00 

Ambiguity [SentiStr.] O.07*** = O.21*** — O.07*** 
Ambiguity [LIWC] Ole O14 O11 
Ambiguity [tidytext] 0.08*** 0.19***  0.10*** 
Ambiguity [VADER] _ 0.04*** —0.06*** — 0.04*** 


0.16*** -0.64*** — 0.00 
0.29*** -0.28*** — 0.30*** 
tidytext Binary [1] 0.30*** = -0.47*** — 0.11*** 
VADER Binary [1] 0.14*** -0.48*** —-0.03*** 
R? 0.22 0.75 0.24 


SentiStr. Binary [1] 
LIWC Binary [1] 


Note: ***p<0.001 **p<0.01 *p<0.05. 


4.3 Understanding Scale Discrepancies (RQ3) 
Linear models for evaluating scale discrepancies included 
four notable associations between tweet properties and scale 
discrepancies (Table 4). First, all four ambiguity measures 
were positively associated with scale discrepancy measures 
across all three models, most notably SentiStrength’s ambi- 
guity measure with negativity scale discrepancy, 8 = 0.21, 
t(1382478) = 229.30, p < .001. Second, for binary classifi- 
cations (i.e., positive/neutral vs. negative), negative tweets 
tended to have higher negativity discrepancy and vice versa. 
For example, discrepancy in negativity scales was negatively 
associated with tweets classified as positive/neutral by Sen- 
tiStrength, @ = -0.64, ¢(1382478) = -281.13, p < .001. Mean- 
while, tidytext classifying tweets as positive was associated 
with increased positivity discrepancy, 6 = 0.30, ¢(1382478) 
= 129.71, p < .001. Third, text- and tweet-specific variables 
(e.g., number of words, likes, and retweets) did not seem to 
be associated with scale discrepancy, while tweet context 
had a small effect size. For instance, tweets from State Ed- 
ucational Twitter Hashtags were positively associated with 
negativity scale discrepancy, 3 = 0.05, t(1382478) = 52.86, 
p < .001. Fourth, the explained variance in scale discrep- 
ancy was highest for negativity scales at 75.3%, followed by 
overall scales (23.5%) and positivity scales (22.4%). 


5. DISCUSSION 
5.1 Key Findings 
This study evaluates sentiment analysis methods on educa- 
tional Twitter data. Our three main findings are as follows: 


First, negative sentiment is difficult to reliably detect with 
dictionary approaches. This could be due to nuanced lin- 
guistic markers (e.g., sarcasm) that require advanced algo- 
rithms to be detected [31]. While this finding aligns with 
previous work [30], it contrasts initial validations of Sen- 
tiStrength [37] on a set of around 1,000 MySpace comments 
[37]. Nonetheless, this highlights the importance of validat- 
ing commonly used sentiment analysis tools across multiple 
contexts. Thus, we encourage researchers to carefully exam- 
ine how negativity may be expressed in their study context. 


Second, in the context of educational Twitter data, binary 
sentiment classifications that combine positive and neutral 
sentiment are substantially more robust than trinary clas- 
sifications. Thus, researchers may consider computing the 
ratio of negative to positive/neutral tweets, similar to a re- 
cent Twitter study on the Common Core State Standards 
[38]. For a continuous variable, our findings suggest using an 
overall scale, as discrepancy in negativity was substantially 
associated with ambiguity and binary classifications. 


Third, in the context of educational Twitter data, tidytext 
and VADER produce more accurate classifications of neg- 
ative sentiment than LIWC. Notably, tidytext also has the 
highest dictionary coverage. Thus, educational researchers 
are encouraged to aggregate multiple dictionaries or to cre- 
ate domain-specific sentiment dictionaries for more reliable 
measures of negative sentiment, for instance, similar to a 
previous study investigating political expression [11]. 


5.2 Limitations 

This study has two notable limitations. First, the sample 
size of the training data is relatively small (N = 300). That 
said, it is comparable to sample sizes of previous sentiment 
measure validation studies [28]. Similarly, the lack of neg- 
ative tweets in our training data (11.67%) may limit infer- 
ences about that particular type of sentiment. The number 
of negative tweets is considerably smaller compared to pre- 
vious validation studies utilizing text data from sources such 
as MySpace, Twitter, BBC forums, and YouTube that in- 
clude up to 86.84% negative sentiment [8]. Therefore, future 
validation studies should deliberately sample more negative 
tweets [13] or sample from contexts with higher expected 
negativity, such as Common Core State Standards hashtags 
[38]. Second, this study focuses on dictionary-based senti- 
ment analysis, while future studies might also consider fea- 
ture extraction and word co-occurrence methods [15]. 


5.3 Implications 

This study highlights the importance of coverage, validity, 
and scale discrepancy in sentiment analysis, specifically for 
negative sentiment. For educational Twitter data, this study 
recommends using binary classifications or overall scales, 
preferably derived from tidytext or VADER, and encour- 
ages replication studies’ across more educational contexts. 


‘Code: https://github.com/jrosen48/comparing-sentiment 


622 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


[2] 


[3] 


[4 


[5] 


[6] 


[7] 


[8] 


[9] 


[10] 


[11] 


[12] 


[13] 


[14] 


[15] 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


REFERENCES 

A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and 

R. Passonneau. Sentiment analysis of Twitter data. In 
Proceedings of the workshop on language in social 
media (LSM 2011), pages 30-38, 2011. 

A. Aue and M. Gamon. Customizing sentiment 
classifiers to new domains: A case study. In 
Proceedings of recent advances in natural language 
processing (RANLP), volume 1, pages 2-1. Citeseer, 
2005. 

T. A. Craney and J. G. Surles. Model-dependent 
variance inflation factor cutoff values. Quality 
Engineering, 14(3):391-403, 2002. 

A. Edgerton. Learning from standards deviations: 
Three dimensions for building education policies that 
last. American Educational Research Journal, 
57(4):1525-1566, 2020. 

S. Elbagir and J. Yang. Twitter sentiment analysis 
using natural language toolkit and vader sentiment. In 
Proceedings of the International MultiConference of 
Engineers and Computer Scientists, volume 122, 

page 16, 2019. 

C. Fischer, B. Fishman, and S$. Y. Schoenebeck. New 
contexts for professional learning: Analyzing high 
school science teachers’ engagement on Twitter. 
AERA Open, 5(4), 2019. 

C. Fischer, Z. Pardos, R. Baker, J. Williams, 

P. Smyth, R. Yu, S. Slater, R. Baker, and 

M. Warschauer. Mining big data in education: 
Affordances and challenges. Review of Research in 
Education, 44(1):130-160, 2020. 

P. Gongalves, M. Aratijo, F. Benevenuto, and M. Cha. 
Comparing and combining sentiment analysis 
methods. In Proceedings of the first ACM conference 
on Online social networks, pages 27-38, 2013. 

J. Grimmer and B. Stewart. Text as data: The 
promise and pitfalls of automatic content analysis 
methods for political texts. Political analysis, 
21(3):267-297, 2013. 

L. Hansen, A. Arvidsson, F. Nielsen, E. Colleoni, and 
M. Etter. Good friends, bad news-affect and virality in 
twitter. In Future information technology, pages 
34-43. Springer, 2011. 

M. Haselmayer and M. Jenny. Sentiment analysis of 
political communication: combining a dictionary 
approach with crowdcoding. Quality & quantity, 
51(6):2623-2646, 2017. 

M. Hu and B. Liu. Mining and summarizing customer 
reviews. In Proceedings of the tenth ACM SIGKDD 
international conference on Knowledge discovery and 
data mining, pages 168-177, 2004. 

C. Hutto and E. Gilbert. Vader: A parsimonious 
rule-based model for sentiment analysis of social 
media text. In Proceedings of the International AAAI 
Conference on Web and Social Media, volume 8, 2014. 
M. Ibrahim. Extracting weight in Twitter 
SentiStrength dataset to determine sentiment polarity. 
Journal of Information Systems Research and 
Innovation, 10(3):245—265, 2016. 

R. Iliev, M. Dehghani, and E. Sagi. Automated text 
analysis in psychology: Methods, applications, and 
future developments. Language and Cognition, 


[16] 


[17] 


[18] 


[19] 


20 


21 


22 


23 


24 


25 


26 


27 


28 


29 


[30] 


[31] 


[32] 


7(2):265-290, 2015. 

J. Kahn, R. Tobin, A. Massey, and J. Anderson. 
Measuring emotional expression with the linguistic 
inquiry and word count. The American journal of 
psychology, pages 263-286, 2007. 

M. Kern, G. Park, J. Eichstaedt, A. Schwartz, M. Sap, 
L. Smith, and L. Ungar. Gaining insights from social 
media language: Methodologies and challenges. 
Psychological methods, 21(4):507, 2016. 

T. K. Koo and M. Y. Li. A guideline of selecting and 
reporting intraclass correlation coefficients for 
reliability research. Journal of chiropractic medicine, 
15(2):155-163, 2016. 

J. S. Long and L. H. Ervin. Using heteroscedasticity 
consistent standard errors in the linear regression 
model. The American Statistician, 54(3):217-224, 
2000. 

T. Loughran and B. McDonald. When is a liability not 
a liability? Textual analysis, dictionaries, and 10-ks. 
The Journal of Finance, 66(1):35-65, 2011. 

M. McHugh. Interrater reliability: the kappa statistic. 
Biochemia medica: Biochemia medica, 22(3):276-282, 
2012. 

A. Mittal and A. Goel. Stock prediction using Twitter 
sentiment analysis. Standford University, CS229, 15, 
2012. 

S. Mohammad. Sentiment analysis: Detecting valence, 
emotions, and other affectual states from text. In 
Emotion Measurement, pages 201-237. Woodhead 
Publishing, 2016. 

S. Mohammad and P. Turney. NRC emotion lexicon. 
Technical report, National Research Council, Canada, 
2013. 

I. Mozetic, L. Torgo, V. Cerqueira, and J. Smailovié. 
How to evaluate sentiment classifiers for Twitter 
time-ordered data? PloS one, 13(3):e0194317, 2018. 

J. Pennebaker, M. Francis, and R. Booth. Linguistic 
inquiry and word count: LIWC 2001. Mahway: 
Lawrence Erlbaum Associates, 71, 2001. 

M. Polikoff, T. Hardaway, J. Marsh, and D. Plank. 
Who is opposed to Common Core and why? 
Educational Researcher, 45(4):263-266, 2016. 

R. Prabowo and M. Thelwall. Sentiment analysis: A 
combined approach. Journal of Informetrics, 
3(2):143-157, 2009. 

M. Rambocas, J. Gama, et al. Marketing research: 
The role of sentiment analysis. Technical report, 
Universidade do Porto, Faculdade de Economia do 
Porto, 2013. 

F. Ribeiro, M. Aratijo, P. Goncalves, M. A. Gongalves, 
and F. Benevenuto. Sentibench - a benchmark 
comparison of state-of-the-practice sentiment analysis 
methods. EPJ Data Science, 5(1):1-29, 2016. 

E. Riloff, A. Qadir, P. Surve, L. De Silva, N. Gilbert, 
and R. Huang. Sarcasm as contrast between a positive 
sentiment and negative situation. In Proceedings of the 
2013 conference on empirical methods in natural 
language processing, pages 704-714, 2013. 

J. Rosenberg, C. Borchers, E. Dyer, D. Anderson, and 
C. Fischer. Advancing new methods for understanding 
public sentiment about educational reforms: The case 


623 


of Twitter and the Next Generation Science 
Standards. OSF' Preprints, 2020. 

[33] J. Rosenberg, J. Reid, E. Dyer, M. Koehler, 

C. Fischer, and T. McKenna. Idle chatter or 
compelling conversation? the potential of the social 
media-based #:ngsschat network for supporting science 
education reform efforts. Journal of Research in 
Science Teaching, 57(9):1322-1355, 2019. 

[34] J. Silge and D. Robinson. tidytext: Text mining and 
analysis using tidy data principles in R. JOSS, 1(3), 
2016. 

[35] J. Silge and D. Robinson. Text mining with R: A tidy 
approach. O’Reilly Media, Inc., 2017. 

[36] Y. Tausczik and J. Pennebaker. The psychological 
meaning of words: LIWC and computerized text 
analysis methods. Journal of language and social 
psychology, 29(1):24-54, 2010. 

[37] M. Thelwall, K. Buckley, G. Paltoglou, D. Cai, and 
A. Kappas. Sentiment strength detection in short 
informal text. Journal of the American society for 
information science and technology, 61(12):2544—-2558, 
2010. 

[38] Y. Wang and D. Fikis. Common Core State Standards 
on Twitter: Public sentiment and opinion leaders. 
Educational Policy, 33(4):650-683, 2019. 

[39] P. Yeh. More accurate tests for the statistical 
significance of result differences. arXiv preprint 
cs/0008005, 2000. 


624 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


Knowledge Tracing Models’ Predictive Performance when 
a Student Starts a Skill 


Jiayi Zhang, Rohini Das, Ryan S. Baker, Richard Scruggs 


University of Pennsylvania 
{joycez, rybaker, rscr}@upenn.edu, rohinidas604@gmail.com 


ABSTRACT 


Previous studies on the accuracy of knowledge tracing models 
have typically considered the performance of all student actions. 
However, this practice ignores the difference between students’ 
initial and later attempts on the same skill. To be effective for uses 
such as mastery learning, a knowledge tracing model should be 
able to infer student knowledge and performance on a skill after 
the student has practiced that skill a few times. However, a 
model’s initial performance prediction — on the first attempt at a 
new skill — has a different meaning. It indicates how successful a 
model is at inferring student performance on a skill from both 
their performance on other skills and from the difficulty and other 
properties of the first item the student encounters. As such, it may 
be relevant to differentiate prediction in these two contexts when 
evaluating a knowledge tracing model. In this paper, we describe 
model performance at a more granular level and examine the 
consistency of model performance across the number of student 
instances on a given skill. Results from our research show that 
much of the difference in performance between classic algorithms 
such as BKT (Bayesian Knowledge Tracing) and PFA 
(Performance Factors Analysis), as compared to a modern 
algorithm such as DK VMN (Dynamic Key-Value Memory 
Networks), comes down to the first attempts of a skill. Model 
performance is much more comparable by the time the student 
reaches their third attempt at a skill. Thus, while there are many 
benefits to using contemporary knowledge tracing algorithms, 
they may not be as different as previously thought in terms of 
mastery learning. 


Keywords 


Knowledge Tracing, Cold Start, Deep Knowledge Tracing, 
Bayesian Knowledge Tracing, Performance Factors Analysis, 
Dynamic Key-Value Memory Networks 


ie INTRODUCTION 


Knowledge Tracing (KT), attempting to measure student 
knowledge through performance during learning, is a critical 
component in modern intelligent tutoring systems and adaptive 
learning systems [18]. These models use students’ previous 
performance to predict their proficiency on latent knowledge and 
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infer their likelihood of success in future attempts within the 
learning system. 


For well over a decade, Bayesian Knowledge Tracing (BKT; [5]) 
was the dominant algorithm in research on knowledge tracing — it 
remains the dominant algorithm in use in systems used at scale by 
students today. Later on, two waves of competing algorithms 
emerged — a first wave around 2010, including many 
psychometrically-influenced algorithms such as Performance 
Factor Analysis (PFA; [17]) and a second wave in the mid-to-late 
2010s based on neural networks, including Deep Knowledge 
Tracing (DKT; [19]) and Dynamic Key-Value Memory Networks 
(DKVMN; [26]). Work over the last decade has shown that variants 
of BKT and PFA that take individual differences and timing into 
account perform better [9, 15, 25]. The current wave of algorithms 
based on neural networks, such as DKT and DKVMN, have 
reported further improvements to model fit [12, 26]. 


The comparisons between these algorithms have generally focused 
on metrics comparing overall success at predicting on later items, 
within the learning system applied to held-out students. In these 
comparisons, multiple large data sets are typically used, but 
performance is considered evenly across the data set. However, 
there are some reasons to think this may be a concerning practice. 
For one thing, even though the data sets used are typically large, 
these papers generally do not report if samples are large for all skills. 
Coetzee [4] notes that BKT parameter estimation is more precise 
for larger data sets than smaller data sets. Furthermore, Gervet [10] 
concluded that algorithms based on logistic regression, such as PFA, 
tend to underfit large datasets, while deep learning based 
algorithms, like DKT, tend to overfit larger datasets. 


More concermingly, many data sets used in student modeling have 
skills which have only been encountered once or twice by many 
students, either due to stop-out [3] or rarely-tagged secondary skills. 
Slater and Baker [22] suggest that BKT models cannot be reliably 
fit unless there is sufficiently large pool of students who have at 
least three opportunities to practice each skill. As such, large 
proportions of existing data sets may reflect a seeming special case. 
Indeed, accurate prediction on these items likely reflects something 
different than accurate prediction after a student has had more 
practice. When a student has not yet worked on a skill, predicting 
their performance at this point represents what is referred to as a 
“cold start problem” — needing to perform well before having 
sufficient data for the current student [24]. It is possible that some 
more recent algorithms may perform better in these situations than 
earlier algorithms, either by using information from the student’s 
performance on other skills or information on the difficulty or other 
properties of specific items. However, this better performance may 
reflect something different than the student’s knowledge of the 
current skill being studied. As such, it may be meaningful to 
separate out cold start situations (for a given student and skill) from 
situations where the model has sufficient data to estimate the 
current skill by itself, when comparing KT algorithms. 
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In this paper, we study how the performance of three KT algorithms 
changes, depending on how much data the algorithms have on the 
current student’s performance on the current skill. We compare the 
classic algorithms BKT and PFA to a more recent neural network- 
based algorithm, DK VMN, using the ASSISTments 2009-2010 
Skill Builder data [7]. Within each model, the predictive 
performance, determined by AUC ROC (Area Under the Receiver- 
Operating Characteristic Curve) and RMSE (Root Mean Square 
Error) was analyzed at students’ first through eighth encounter on 
a skill, reflecting the changes in model performance as students 
practice a skill more. We conclude with a discussion of the 
implications of our finding, for both the evaluation and use of 
knowledge tracing models. 


2. METHODS 


2.1 Data 

In this study, we utilized the ASSISTments Skill Builder 2009- 
2010 dataset [7], using the updated version which represents an 
item requiring multiple skills as a single data point [23]. This 
specific dataset was chosen because it has clearly defined skills and 
because this dataset had frequently been used to compare KT 
models in previous research [11, 13, 14, 23, 27]. 


In the data preprocessing stage, we removed items not linked to any 
skill. Each student attempt was annotated with how many 
opportunities to practice the relevant skill(s) the student had 
encountered so far — i.e., the first instance means the learner is 
encountering a skill for the first time, the eighth instance indicates 
that the learner is encountering the skill for the eighth time. The 
resultant data set consisted of 4,151 students who attempted 16,891 
unique problems on 101 skills, resulting in 274,590 responses. 
While all the skills were included in model training, only the four 
most common skills are discussed below (see Table 1). 


While using the ASSISTments platform, students have to 
correctly answer n problems in a row to achieve mastery of a skill 
(where 7 is set by the teacher, but is usually three) and can only 
then move on to another skill. Given the design of the platform’s 
three-in-a-row mastery learning approach, there is a drop in 
sample size as the number of instances increases (a common 
pattern in adaptive learning systems). There is also attrition due to 
stop-out, where students stop working on a problem set without 
mastering it [3]. Table 1 shows that across all four skills, the 
number of students encountering a specific skill n times decreased 
with instance. Of the four skills, an average 20% and 45% 
attrition rate is observed on the third and eighth instances, 
respectively. 


Table 1: Number of students per instance in each skill 


Skill Name 1 2 3 4 5 6 7 8 


Addition and Subtraction 


: 1353 1066 978 920 836 | 756 692 | 625 
Fractions 


Addition and Subtraction 


1226 1021 790 693 640 | 579 510 | 460 
Integers 


Conversion of Fractions 


Désimals & Percents 1225 1145 1121 1034 982 928 852 781 


Equation Solving Two or 
Fewer Steps 


961 877 857 821 795 745 722 690 


2.2. Model Construction 

We constructed the following three knowledge tracing models 
with the preprocessed ASSISTments 2009 dataset: BKT, PFA and 
DKVMN. Each model was implemented with 5-fold student-level 
cross-validation. For the cross-validation, the dataset was split 
into five folds at the student level. Four folds were used to train 


the model and the trained model predicted student’s performance 
in the 5« fold. Each part acted as the test set once. Predictions in 
the test sets were combined and used to compute AUC and RMSE 
for each opportunity to practice, within each skill. For 
comparability, the original skills were used to calculate 
opportunities to practice rather than the new skills derived by 
DKVMN. The folds were kept the same across models, reducing 
the likelihood of randomly favoring one algorithm over another. 
The metrics were averaged across the four skills in each instance 
for each model. 


BKT and PFA predict students’ success at each attempt based on 
their previous performance on the skill. When predicting a 
student’s success on the first attempt of a new skill, without 
having any prior data, the initial prediction made by BKT and 
PFA reflect the overall student performance across the entire 
(training) data set on that skill, instead of the individual student’s 
knowledge level on the skill. By contrast, the deep learning model 
DKVMN utilizes all of a student’s historical data and exploits the 
underlying relationships between concepts. This transferability of 
prediction across skills can be expected to give the algorithm an 
advantage of making the initial predictions on a newly- 
encountered skill. In fact, [14] studied the effect of interaction 
among skills in DKT, a closely-related deep learning model, and 
compared it to BKT. By comparing different approaches to 
leverage skill data, they concluded that DKT’s better performance 
may be largely due to their use of a student’s performance on one 
skill to predict performance on another skill, whereas skills are 
strictly separated in BKT. PFA occupies a middle ground, as skills 
do not directly influence each other, but their combinations in the 
training set may influence the model parameters found during 
fitting. 


The two widely studied deep learning algorithms DKT and 
DKVMN utilize neural networks to discover underlying 
relationships among skills and items when predicting student 
performance. Because of this, both algorithms have shown 
significant improvements in model fit compared to traditional 
algorithms. However, DKT maps the relationships on item level 
while DK VMN fits a skill model from scratch by considering the 
relationship among skills and items. Given the purpose of the 
study is to understand whether transferring information between 
skills influences a model’s accuracy during the first few 
opportunities, DK VMN is a closer comparison to BKT and PFA 
within the class of deep learning-based KT algorithms. 


2.2.1 Bayesian Knowledge Tracing 

Bayesian Knowledge Tracing (BKT; [5]) inputs performance into 
a simple Markov model that is also a Bayesian Network [20]. To fit 
BKT, we applied BKT-Brute Force [1] to the data set with a floor 
of 0.01 for all probabilities and a ceiling of 0.3 for guess and slip to 
avoid model degeneracy [2]. The algorithm produced estimations 
for guessing, slipping, initial knowledge, and learning transition 
probabilities for each of the skills, which were then used to predict 
the probability of success for each student on each opportunity to 
practice each skill. 


2.2.2 Performance Factors Analysis 

Performance Factors Analysis (PFA; [17]) is a model that predicts 
learner performance using a logistic function that models changes 
in performance through learners’ success and failures within a skill. 
In this study, following the formulas in [17], the basin hopping 
algorithm was used to fit the model to obtain the optimal parameters. 
A set of parameters for success, failure and skill difficulty was 
derived for each skill, which were then used to compute the 
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probability P(m) that the student would perform correctly, for each 
student at each opportunity to practice each skill. 


2.2.3 Dynamic Key-Value Memory Networks 
Developed based on neural networks, Dynamic Key-Value 
Memory Networks (DK VMN; [26]) employs two matrices that 
capture states and the relationships between skill and student 
mastery to predict performance on items and estimate mastery on a 
set of automatically-derived skills. We utilized code from Zhang et 
al. [26] to implement the DK VMN model and used the set of 
parameters that produced the optimal outcome for the 
ASSISTments 2009 dataset in the study. The model outputs a 
probability of success for each student at each problem. 


3. RESULTS 
3.1 AUC Results 


Table 2 summarizes the average AUC results for each of the eight 
opportunities to practice each skill and the combined AUC for 
opportunities three through eight in the BKT, PFA, and DK VMN 
models. Additionally, the overall AUC across the first eight 
opportunities is also reported for the four skills. Note that the 
overall AUC only includes the targeted four skills in the first eight 
attempts and therefore, should not be considered to be the overall 
AUC of the algorithm across the entire data set. 


For the first eight instances, a general upward trend is observed in 
AUC for all three models. Starting at the first instance, the AUC 
value for BKT is 0.49, PFA is 0.52, and DK VMN is 0.65. At this 
point, the AUC value for the DKVMN model is much greater than 
that of other two models, by approximately 0.15. Compared to BKT 
and PFA, DK VMN is better at making the initial prediction on the 
very first time a student sees a skill. In fact, at this point, both BKT 
and PFA are performing at or below chance. 


In the following instances, the values of BKT and PFA became 
closer to the performance of DKVMN. In fact, by the fourth 
instance, the models’ AUC values were fairly similar, having a 
range of 0.65-0.70. From the fourth opportunity to the eighth, the 
AUC values increased by 0.02 to 0.06 across skills. Performance 
stayed similar between algorithms at this point, but DK VMN still 
tended to achieve slightly higher performance. Across the 3'-8" 
opportunities, DK VMN averaged AUC 0.02-0.05 higher than the 
other two algorithms (0.70 versus 0.68 for BKT and 0.65 for PFA). 
These trends can be seen in Figures 1-3. 


Table 2: Average AUC values in each instance 


Model Type 1 2 3 4 5 6 7. 8 3-8 ne 
BKT 0.49 0.63 0.68 0.68 0.68 0.68 0.66 0.70 0.68 0.66 
PFA 0.52 0.59 0.63 0.65 0.65 0.65 0.66 0.71 0.65 0.63 

DKVMN 0.65 0.68 0.70 0.70 0.70 0.69 0.69 0.72 0.70 0.69 


Figure 1: AUC results for BKT model across instances 


BKT: AUC 


Figure 2: AUC results for PFA model across instances 


Figure 3: AUC results for DK VMN model across instances 


DKVMN: AUC 
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3.2 RMSE Results 

Table 3 summarizes the average RMSE results for each opportunity 
to practice the skills and the combined RMSE for the 3"¢-8" 
opportunities and the 1‘-8" opportunities in the BKT, PFA, and 
DKVMN models. Again, the RMSE reported in the table only 
considers the targeted four skills in the first eight opportunities. 


The RMSE demonstrates a downward trend across the first eight 
opportunities in all three models. As RMSE measures the 
difference between actual and predicted values, lower RMSE 
values indicate more accurate predictions. In the first instance, the 
RMSE value for BKT is 0.49, PFA is 0.51, and DKVMN is 0.47. 
As the RMSE value for DK VMN is better than that of BKT and 
PFA, similar to the AUC value, DK VMN is better able to predict 
student knowledge at the first attempt (0.02 better than BKT and 
0.04 better than PFA). 


In the following instances, the values of BKT and PFA became 
closer to the performance of DKVMN. In fact, by the fourth 
instance, the models’ RMSE values were fairly similar, having a 
range of 0.43-0.46. From the fourth opportunity to the eighth, the 
RMSE values in all three models roughly remained the same across 
skills. Across the 3'4-8'" opportunities, DK VMN’s average RMSE 
was similar to BKT and 0.02 lower than PFA (0.44 versus 0.44 and 
0.46). These trends can be seen in Figures 4-6. 


Table 3: Average RMSE values for all models in each instance 


Model Type 1 2 3 4 5 6 7 8 3-8 iia 
BKT 0.49 0.46 0.44 0.44 0.44 0.44 0.45 0.44 0.44 0.45 
PFA 0.51 0.48 0.46 0.46 0.47 0.46 0.48 0.46 0.46 0.47 

DKVMN 0.47 0.45 0.44 0.43 0.44 0.44 0.45 0.43 0.44 0.45 
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Figure 4: RMSE results for BKT model in each instance 


BKT: RMSE 


Figure 5: RMSE results for PFA model in each instance 


Figure 6: RMSE results for DKVMN model in each instance 


DKVMN: RMSE 


4. CONCLUSION AND DISCUSSION 


In the last few years, there has been an explosion of interest in new 
variants to knowledge tracing that achieve higher predictive 
performance using neural networks. However, this work has 
generally not yet explored where and when these algorithms 
perform better, and what the implications are for using these models 
in practice. More specifically, previous practices have averaged 
predictions across students’ entire learning history, ignoring the 
difference between the earliest work and later work on a skill. 


In this study, we examined the performance of three KT models, 
BKT, PFA and DKVMN, across students’ history of work on 
specific skills, and compared how the three models differ in 
predictive accuracy during the earliest and later opportunities to 
practice each skill. With all eight opportunities considered together, 
DKVMN outperformed BKT and PFA in both AUC and RMSE. 
However, DK VMN’s better performance appears to be largely due 
to its initial prediction on the first attempt on a skill, in which 
DKVMN ‘s AUC was 0.16 higher than BKT and 0.13 higher than 
PFA, and RMSE was 0.02-0.04 better. After the first attempt, BKT 
and PFA’s predictive performance improved substantially, and 
model performance became closer across the three algorithms after 
the third attempt, though DK VMN remained slightly better. 


The results suggest that much of the difference in performance 
between these algorithms is due to DK VMN’s ability to make more 
accurate initial predictions by using factors other than mastery of 
the current skill, such as past performance on other skills and other 
students’ performance on the same item. In other words, a 
substantial amount of the difference between algorithms appears to 
be due to factors other than estimating mastery of the current skill 
the student is working on, from their performance on that skill. This 
may be especially true in datasets where students stop-out on 
specific skills [3], or where the skill model is added to or modified 
after the system is built. In these cases, many student/skill 
combinations may only occur once or twice, and having relatively 
higher performance on the first attempt will inflate AUC and 
RMSE values for models such as DK VMN. This raises the question 
of what the application is for having better knowledge prediction at 
the first time when a student sees a new skill. This type of 
improvement in prediction may be useful to systems that decide 
which skill a student should work on next (i.e., [6, 28]) but may be 
less useful in systems that have a predefined order of skills for the 
student to work on (i.e. [5, 8]) and the student does not move on 
until they have demonstrated mastery on the current skill. 


Given the difference in predictive performance between situations, 
it may be appropriate to separate out cold start situations (for a 
given student and skill) from situations where the model has 
sufficient data to estimate the current skill by itself, when 
comparing KT algorithms. Specifically, we propose that the 
calculation of predictive metrics should separate out the predictions 
on the initial two opportunities to practice each skill from the rest. 
Adopting this approach will increase our ability to interpret the 
difference between algorithms and understand how much better a 
specific algorithm will be for specific use cases. 


Two limitations to the current analyses can be addressed in future 
work. First, our recommendations may not be meaningful for all 
learning systems where contemporary KT is used. In specific, some 
systems may not have skill models at all, and may never intend to 
make inference at the level of interpretable skills. Although these 
systems typically use an entirely different family of KT models (i.e. 
{16, 21]), our recommendations would not be relevant in these 
cases. Second, we have only investigated these issues in the context 
of a single system and a set of skills for which there is extensive 
data, and for three algorithms; the generalizability of the findings 
presented here should be further investigated, using data from other 
learning systems where, for instance, the granularity of the skills 
differs. However, only limited effort is needed to separate out 
practice on early learning opportunities from later learning 
opportunities when calculating model AUC/RMSE. Therefore, it 
may be warranted to adopt this approach and see whether practical 
differences are also found for other contexts and algorithms as well. 


Overall, we find initial evidence that one key factor leading to 
better performance for DK VMN compared to earlier algorithms is 
its performance in situations, before a student has had significant 
opportunity to work on a skill. This result leads to 
recommendations in how to better evaluate KT algorithms and 
suggests that the benefits of this algorithm may be greater for some 
applications (deciding which skill a student should work on next) 
than others (deciding if a student has reached mastery in the current 
skill they are working on). From the results of this study, future 
studies conducting research involving KT models may find it useful 
to calculate performance separately for a student’s initial 
performance and their later performance on a skill; this would 
provide researchers with more information on how their models are 
working, and where their greatest benefits and potential are. 
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ABSTRACT 


Computer-based learning environments offer the potential 
for automatic adaptive assessments of student knowledge 
and personalized instructional policies. In prior work, we 
introduced an individualized Bayesian model to dynami- 
cally assess student’s knowledge, based on observed response 
times and response accuracy. In this paper, we leverage 
that model as a stopping instructional policy to determine 
when to stop the assessment. We evaluate several criteria 
based on the change of performance measures as questions 
are presented. These include the mean assessment level and 
the Kullback-Leibler divergence. Student performances are 
simulated considering their sensitivity to the prior belief for 
mastery over different educational cases. Our results indi- 
cate which criteria offer an efficient assessment, a confident 
assessment, and which can effectively handle wheel-spinning 
students. 


Keywords 

Bayesian Adaptive Mastery Assessment; stopping policy; 
individualization; empirical analysis; performance model; 
mastery criteria 


1. INTRODUCTION 


In adaptive learning systems, mastery is measured as a stu- 
dent performs a skill and demonstrates knowledge by solving 
a sequence of questions that tap that skill. Learner models 
that rely on the mastery learning theory are widely used 
in various personalized adaptive learning systems to infer 
student mastery sequentially. 


In a mastery learning framework, ’under-practicing’ and ’over- 
practicing’ are two common pitfalls that cause students to 
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face a practicing or testing burden rather than focus on the 
skill of their level [1, 2]. This might cause demotivation 
and low engagement [3, 4, 2, 1, 5, 6]. Particularly, stu- 
dents trapped in a mastery assessment cycle are referred to 
as wheel-spinning students [7, 3, 8, 9]. They are consis- 
tently unable to reach the mastery-success criterion set for 
the skill, which triggers the system to present even more 
items. In our previous paper, we proposed,Bayesian Adap- 
tive Mastery Assessment(BAMA), a framework we created 
to assess a student individually on a single skill given an 
explicit mean success criterion. From an educational per- 
spective, it can be used as a criterion-referenced assessment 
to assure mastery [10, 6, 11, 12, 13]. We evaluated the util- 
ity function of BAMA as a when-mastery-is-attained policy 
and show that it accurately recovers the true mastery effi- 
ciently, i.e., with few responses. However, this strategy is 
not sufficient as it assumes that all students at some point 
will reach that criterion [7, 2, 3, 4, 14, 9]. 


In this paper, we thereby evaluate the impact of the util- 
ity function of BAMA as a stopping policy. We design im- 
plicit stopping criteria and we provide an empirical analysis 
considering the variance of length practice across simulated 
student performances. We demonstrate that the developed 
policy delivers meaningful results and identifies any student 
profile, including wheel-spinners. 


2. RELATED WORK 


Student profiles aim to portray the individual performance 
of each learner. Based on the response time of student per- 
formances, learning sciences distinguish between struggling 
fluent from fluent as the latter provide correct responses 
with short response times [15]. An individual who has not 
yet acquired the skill and will not demonstrate successful 
performance is commonly modeled as having a low proba- 
bility of a correct response [8, 2, 7, 9, 16]. These students 
are termed as wheel-spinning students [7, 3, 8, 9] and have 
been linked with long response times [8]. 


An instructional policy, also known as a stopping policy, 
refers to the total length of the assessment when a pre- 
specified stopping criterion accompanies the model. The cri- 
teria are divided into two categories: (i) an explicit threshold 
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set to a statistic of the mastery estimator, known as a mas- 
tery success criterion, and (ii) an implicit threshold set to the 
size of change of a statistic of the mastery estimator. The 
former framework is typically referred to as when-mastery- 
is-attained policy, and the latter as a when-to-stop policy, as 
it stops, independent of whether the student has mastered 
the skill [7]. 


Substantial research efforts have focused on the impact of 
a learner model concerning the total number of questions 
it administers. Machine learning models were designed to 
detect wheel spinner performance [9]. Frameworks of in- 
structional policies [7, 14] and metrics [5] were proposed 
for an evaluation of well-known prediction models on the fi- 
nal proposed length. Other work specified a framework for 
a conceptual interpretation over the stopping criterion [3]. 
Typically, these models assume a homogeneous class of stu- 
dents and they consider solely response accuracy. Previous 
research has shown that individualized models lead to signif- 
icantly different policies [4] and highlighted the importance 
of response times in stopping policies [9, 12, 17, 18, 13, 10, 
19, 2]. 


3. MODEL AND STOPPING POLICY 


Below we briefly discuss the assessment model, the stopping 
criteria we consider, and the steps of our experiment. 


3.1 Bayesian Adaptive Mastery Assessment 
In the BAMA model, a student has a constant mastery level 
Z on a single skill which is the product of two independent 
random variables, the response time T’ ~ Exponential()) 
and the accuracy P ~ Bernoulli(@). We denote with 7 the 
maximum response time. The score Z is close to 1 when 
a student answers correctly and relatively fast with respect 
to T. The value of Z becomes zero when a student answers 
incorrectly, or when the response time exceeds 7. That is 
operationalized as follows: Z = P- (1 — ze 


To keep the formulation tractable, we adopt a Bayesian ap- 
proach to estimate the true unknown parameters @ and X of 
a student. We model 6 by a Beta(a, 3) distribution, and 
by a Gamma(n, ¥) distribution. This represents the prior 
distribution over the unknown parameters (6, 4), denoted as 
po, as an initial belief over a student’s mastery. The model 
updates the belief on a posterior distribution p over these 
parameters under the Bayes rule. As more responses be- 
come available, the posterior distributions of the accuracy 
(the Beta distribution) and the response time (the Gamma 
distribution) become more centered and peaked around the 
true values of 6 and 4. However, this information is not 
known in practice and needs to be estimated from the ob- 
servations received over the assessment. 


3.2 Stopping Criteria 

A respective policy is concerned with the nature of the es- 
timated Z-score and adopts a different stopping rule. We 
employ the change of a point estimate, and the change of the 
distribution. These are computed according to the change 
observed between consecutive pairs of responses over the se- 
quence. For the analysis and the evaluation of a policy, the 
following four properties are typically considered [20, 7]: 1) 
number of administered items, 2) number of non-stopping 


situations, 3) accuracy with regard to the true value, 4) un- 
certainty of the experiment and of the model. 


The derivative-based stopping rule considers the reduction 
of changes observed between consecutive pairs of responses 
as measured with a pre-specified sample statistic of the Z 
distribution. To put this formally, let Af; = fi-1—f; for any 
function f. Then, our policy proposes to stop after response 
i when the following decision rule holds. 


|Ahi-1| <€A|Ahil <, (1) 


where h; denotes the value of a sample statistic of the dis- 
tribution Z after the i-th observation, such as the mean, 
variance, or any other function. The rule indicates that in a 
sequence of three responses so far, two values for that rule 
are computed. Similar to all implicit-based stopping rules, 
the threshold value denoted as ¢€ will also inevitably affect 
the length of the assessment, i.e., as € gets smaller, the longer 
the assessment becomes. That is a special case of the prob- 
abilistic stopping rule proposed in [7] which doesn’t directly 
generalize to our model. 


In our first experiment, we leverage the derivative-based rule 
by considering the change of the posterior mean from the 
prior mean. Point-based estimates from sample statistics 
are all informative metrics that can be employed. However, 
other estimated statistics may exist to describe the informa- 
tion of a distributional score that may better accommodate 
a balanced length assessment. Thereby, a more elegant solu- 
tion would be to calculate a metric that considers the whole 
distributional information obtained for Z at once. 


We compute the second rule based on the reduction of di- 
vergence between two consecutive distributions of responses, 
the starting prior Z;-1 and the updating posterior Z;, after 
item i has been administered. We formulate this with the 
Kullback-Leibler (KL) divergence Dxyz, as follows: 


Dee By= [aac ibe ()) dz. (2) 


The quantity z:(x) describes the density of the distribution 
Z; at response 7 evaluated at x. 


3.3 Simulated performance profiles 

A student is characterized by the pair (6,) for their per- 
formance. For the exposition of our purpose, we take four 
equidistant intervals of Z defined as: mastered or fluent (Z € 
[0.75—0.95]), accurate or struggling fluent (Z € [0.5—0.74]), 
undetermined or average (Z € [0.2 — 0.49]), wheel-spinning 
(Z € [0—0.19]). Then, we arbitrarily draw a specific pair of 
(8, A) corresponding to the Z score from each interval. Par- 
ticularly, we illustrate the following levels: mastered with 
high accuracy and short response times (6 = 0.9,A = 1) > 
Z = 0.85 , accurate with high accuracy and long response 
times (9 = 0.9,A = 0.1) > Z = 0.50, undetermined with 
(@ = 0.5,A = 0.5) > Z = 0.46, and wheel-spinning with 
(9 =0.1,A =0.1) — Z = 0.08. 


4. RESULTS 


We evaluate our stopping criteria through simulated student 
performances. In practice, this translates to n observations 
of responses 11,..., 2m according to the student profile (0, \) 
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Figure 1: The size of the change between consecutive estimated expected values of Z for a prior py = 0.3 and a prior pe = 0.7. 
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Figure 2: The KL divergence between consecutive estimated distributional scores for a prior py = 0.3 and a prior pe = 0.7. 


and prior po. We update the prior distribution as observa- 
tions arrive, i.e., p; based on pi-1 and a:-1. This allows 
us to collect statistics on the Z-score for each administered 
question. We repeat this 1,000 times to get accurate results 
for the statistics. To ensure that the practice length is not 
highly sensitive to the choice of the prior belief po, we simul- 
taneously consider two priors. An optimistic view, denoted 
as pa , assuming a student who has mastered the skill, and 
a pessimistic view denoted as pg , assuming a student who 
has not yet mastered the skill. Considering a fixed maxi- 
mum number of responses n and updating simultaneously 
pq and py additionally balances the efficiency and certainty 
of the assessment. 


We perform our experiments according to the above proce- 
dure for each student profile (6,A) and prior po for a se- 
quence of length n = 20, similarly to previous research [7, 
3, 9]. We set symmetric values of priors as pe = 0.7 and 
py = 0.3. The value of the maximum permitted response 
time is arbitrarily set to 7 = 20, and the fastest answer to 
A= 1. 


4.1 Change of the posterior predictive mean 

Figure 1 shows the derivative rule described in Equation (1) 
implemented for the posterior mean fi. Particularly, the 
magnitude of change Aji; is depicted over consecutive re- 


sponses i across the student profiles (0,A). The response 
interval at which Aji; does not change anymore is observed 
by the converging lines. 


Intuitively, one would expect that the algorithm would pro- 
pose more questions to wheel-spinning and undetermined 
students compared to mastered students. However, that is 
not the case when our starting belief, pg = 0.3, is closer 
to the true posterior. Instead, the mastered students will 
be proposed to provide more responses. The situation is 
reversed when we start with an optimistic prior pe ‘ 


Second, the algorithm adjusts quickly to the student’s prac- 
tice despite the presence of a non-representative prior. To 
illustrate this, take the mastered student. Also, take the 
same length of items, e.g., the first three questions. When 
we start with a representative prior for the student, in this 
case pe, the reduction of the change will be twice smaller 
compared to the reduction of the change observed when we 
start with the non-representative prior, pq . 


4.2 Statistical divergence between consecutive 


distributional scores 
Figure 2 shows the divergence of the estimated distribution 
Dxx(i) described in (2). Wheel-spinners have Dx1(i) > 0, 
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Table 1: Analysis and evaluation of stopping policies per profile, criterion, prior and threshold. 


Assessment length: (SE, 6, |"—"|% ) 
Stopping rule Mastered Accurate Undetermined Wheel-spin 
Apres, 5: (0.0, 0.2, 2.84) 7: (0.0, 0.33, 8.67) 8: (0.01, 0.4, 16.76) 12: (0.0, 0.27, 135.04) 
Ape, 4: (0.0, 0.19, 3.49) 5: (0.0, 0.32, 9.99) 4:(0.01, 0.36, 26.62) 7: (0.0, 0.3, 217.43) 
Apo, 11: (0.0, 0.31, 11.04) | 7: (0.0, 0.35, 5.43) 8: (0.01, 0.4, 6.93) 8:(0.0, 0.23, 81.34) 
Aji, 9: (0.0, 032, 12.66) 4: (0.01, 0.35, 9.55) 3: (0.01, 0.36, 14.4) 5:(0.0, 0.25, 119.89) 
AD'S; 6: (0.0, 0.2, 1.92) 12: (0.0, 0.34, 5.76) 8: (0.01, 0.4, 16.76) 7: (0, 0.3, 217.43) 
ADxx??, 4: (0.0, 0.19, 3.49) 8: (0.0, 0.33, 7.11) 5: (0.01, 0.38, 21.53) | 6: (0.0, 0.31, 242.25) 
ADxx"9, 12: (0.0, 0.31, 10.39) | 14: (0.0, 0.35, 4.1) 19: (0.0, 0.43, 2.56) | 6: (0.0, 0.24, 95.95) 
ADxrP, 7: (0.0, 0.32, 15.77) 8: (0.0, 0.35, 4.94) 4: (0.01, 0.38, 11.09) | 6: (0.0, 0.24, 95.95) 


in contrast to the mastered students, who have Dx(i) < 0. 
This can be attributed to the prior under- or overestimating 
the Z score. 


The results of Dxr(i) are consistent to the posterior mean 
ft. We observe a shorter length between two responses when 
the prior is representative for the posterior. 


4.3 Analysis and evaluation of the policies 
Table 1 reports the results of the implemented stopping poli- 
cies. For each student profile and stopping rule, as presented 
by the columns and rows, we find the number of items at 
which each rule proposes to stop and the variance of the 
assessment length for different profiles. For each stopping 
criterion, the prior distribution po is depicted as a super- 
script and the threshold € as a subscript. 


For Afi and the optimistic prior, the simulated students need 
to solve at most 5-12 questions; whereas for Aji and the pes- 
simistic prior, the simulated students need to solve at most 
3-11 questions, depending on the chosen threshold. Con- 
sidering a single prior, the optimistic one performs better 
across all students compared to the pessimistic one. Those 
policies are depicted in bold letters. 


For ADxt and the optimistic prior, the simulated students 
need to solve at most 4-12 questions, depending on the 
threshold value. For ADxx and the pessimistic prior, there 
is a chance of a non-convergent policy. That holds for the 
undetermined student as the policy converges only at the 
end. This is depicted with the italic letters in the table. 


The assessment length is short when the prior is close to 
reality. This is depicted for the lenient threshold, e.g., in 
the case of p{ for a mastered student and pp for an unde- 
termined student. Therefore, we satisfy both priors simul- 
taneously. In that case, the maximum number of questions 
is 9 for Aji. We get the same estimate of items with almost 
the same uncertainty for both thresholds. Hence, we argue 
that a shorter assessment length is preferred. It also shows 
that the policy is less dependent on the value of e. 


The results of the lenient threshold stopping policies of ADx1 
and the AD; show that we achieve an efficient assessment 
for both priors across all student performances. The satis- 
faction of both priors is an efficient length considering that 


in criterion-referenced assessments, at least n = 4 responses 
are required to estimate the mastery of a single skill. Fur- 
thermore, we observe that using both priors results in more 
efficient assessment of wheel-spinning students. In the en- 
vironment we have simulated, we see that one metric is 
preferred towards the other under a certain objective. To 
achieve efficiency for mastered and wheel-spinners, the KL 
can be used. When the objective is shifted towards effi- 
ciency among the average profiles, then the mean could be 
a more appropriate metric. That doesn’t generalize to other 
settings. 


5. CONCLUSIONS 


To conclude, we analyzed the performance of different stop- 
ping policy rules for the utility function of the BAMA frame- 
work. The stopping policy is constructed using both the 
pessimistic and the optimistic prior for the assessment with 
a maximum length of n = 20. This has several advantages: 
fluent students will be picked up by the optimistic prior, 
wheel-spinners by the pessimistic prior, and the other two 
profiles by either one of the prior distributions. Consistent 
behavior was found between the two criteria. Furthermore, 
the lenient threshold is favored in both criteria. The mean 
assessed mastery level (i.e., Aji) stopping criterion slightly 
outperformed the divergence of assessed mastery level (i.e., 
ADxz). The evaluation of the stopping policies is based 
on these properties — fewer items, none non-convergent per- 
formance case, and relative percentage approximation error 
is low with high certainty. The simulated data has features 
that we modelled explicitly. As future work, we plan to eval- 
uate the stopping policies in real-world scenarios with real 
data and provide a way to represent the average response 
time and the average response accuracy of the student per- 
formance. 
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ABSTRACT 


Careless responding and keeping students motivated for different 
tests have been common problems in many areas, especially in 
education. This study’s objective was to demonstrate a novel 
approach to detect careless responding using person-fit indices 
developed within the field of psychometrics combined with a 
random forest. The data used was obtained from various tests in 
the Math Nation virtual learning platform. The result of person-fit 
indices as previously used measures of careless responding as 
well as the result of a random forest classifier to capture careless 
responding were compared by Receiver Operating characteristic 
(ROC) analysis and the area under the curve (AUC). The result 
showed that random forest combined with person-fit indices 
outperformed person-fit indices directly in detecting careless 
responding. Some important applications of this method for 
applied researchers are discussed in the conclusion section. 
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Virtual learning environment 


1. INTRODUCTION 


Paulhus (1991) defined response bias as “a systematic tendency 
to respond to a range of questionnaire items on some basis other 
than the specific item content.” Response biases are particularly 
important as they become a threat to the validity of conclusions 
and lead to measurement error (e.g., Meade & Craig, 2012). Three 
forms of biases frequently addressed in psychological and 
educational studies are acquiescence, careless responding, and 
social desirability bias (e.g., Cheung & Rensvold, 2000; Leite & 
Nazari, 2020; Wise, 2015). In formative assessments administered 
within virtual learning environments (VLE; Weller, 2007), it is 
hard to keep students motivated across different tests. Lack of 
motivation to complete formative assessments may result in 
careless responding, which in turn may produce biased estimates 
of student ability. This study focuses on careless responding, 
which occurs when students do not put much effort and thought 
into answering an assessment item (Voss & Vangsness, 2020). 
Wise and Kong (2005) have shown that careless responding can 
be associated with disengagement in an assessment. Other 
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common names frequently used for this type of responding are 
non-effortful responding, inattentive responding, rapid guessing 
behavior, and examinee motivation (e.g., Rios & Soland, 2020; 
van Barneveld, 2007; Wise & DeMars, 2010; Wise, 2015; Wise and 
Kong, 2005). Throughout this paper, we use the “careless 
responding” label. 


Researchers have proposed several different methods to identify 
careless responding. Attention check items including directed 
response items (e.g., “Please answer ‘disagree’ when responding 
to this item”), bogus items (e.g., “I do not understand a word of 
English,” Meade & Craig, 2012), maximum longstring index, 
even-odd consistency, and Mahalanobis distance have been 
utilized to capture carelessness (e.g., Voss & Vangsness, 2020; 
Niessen et al. 2016). Self-report surveys have also been 
administered after the main test (e.g., Voss & Vangsness, 2020; 
Niessen et al. 2016; Wise, 2015), and response time effort and 
different types of time thresholds (Wise, 2017; Wise, 2015; Rios & 
Soland, 2020) are computed. Other researchers (e.g., Niessen et 
al., 2016; Patton et al., 2019) have used person-fit indices such as 
the number of Guttman errors and the standardized 
loglikelihood to detect careless responding. Since there is a wide 
range of available person-fit indices in the literature and each 
index captures misfitting persons from a different perspective, 
we opted for investigating person-fit indices. 


Several nonparametric and parametric person-fit statistics (see 
Table 1) detect response biases that can be employed to detect 
carelessness (e.g., Karabatsos, 2003). Most of these indices were 
developed to target dichotomous items, and some of them have 
been extended to test polytomous items. Under the item 
response theory (IRT) framework, parametric person-fit indices 
assess the model fit at the individual level to examine the 
meaningfulness of obtained test scores (Embretson & Reise, 
2013). In a way, the consistency of individuals’ item response 
vectors is examined based on an IRT model by these indices 
(Embretson & Reise, 2013), which is the main concern when 
attempting to flag particular individual responses that may have 
been careless. 


Although previous research on using person-fit indices to detect 
careless responding has provided promising results (Karabatsos, 
2003), the current study attempted to improve the performance 
of person-fit indices by using multiple indices simultaneously. 
This is similar to considering multiple predictors in the model, 
which is possible by machine learning classifiers. Therefore, the 
present study sought to enhance capturing careless responding 
by using random forest (Breiman, 2001; Fernandez-Delgado et 
al., 2014).). The study’s research question is: Does the use of 
person-fit indices as features in a random forest to detect 
careless responding to items in formative assessments of a VLE 
outperform the use of person-fit indices by themselves? This 
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research question was answered with an analysis of responses to 
multiple 10-item formative assessments within the Math Nation 
VLE (Lastinger Center for Learning & University of Florida, 
2019). 


Table 1: Used Person-fit indices in PerFit package 


Person-fit index Function | Author 


Nonparametric 
Personal point-biserial correlation rpbis __|Donlon & Fischer (1968) 
Caution statistic C.Sato | Sato (1975) 
Modified caution statistic Cstar _| Harnisch & Linn (1981) 
Number of Guttman errors G van der Flier, 1977 
Normalized Guttman errors Gnormed_|van der Flier, 1977 
Agreement statistic A.KB Kane & Brennan (1980) 
Disagreement statistic D.KB Kane & Brennan (1980) 
Dependability statistic EKB | Kane & Brennan (1980) 
U3 statistic U3 van der Flier (1980) 
Standardized normal U3 ZU3 van der Flier (1982) 
Norm conformity index NCI Tatsuoka & Tatsuoka 
(1982, 1983) 
HT statistic Ht Sijtsma (1986), Sijtsma 
land Meijer (1992) 
Parametric 


Standardized normal loglikelihood Iz Drasgow et al. (1985) 


Corrected lz Izstar 


Snijders (2001) 


1.1 Theoretical Framework 

A comparison of 36 person-fit indices was performed by 
Karabatsos (2003) to examine the strength of fit indices to detect 
five types of aberrant responding (cheating, careless responding, 
lucky guessing, creative responding, and random responding). 
Results showed that Ht (Sijtsma, 1986) and then U3 (Van Der 
Flier, 1982) person-fit indices produced the highest area under 
the curve (AUC) to detect all types of aberrant responding, and 
other indices such as Guttman errors (Meijer, 1994) and lz 
(Drasgow, Levine, & Williams, 1985) indicated acceptable AUC. 
Additionally, in a simulation study, Artner (2016) investigated 
five well-known indices to detect guessing, cheating, careless 
behavior, distorting, and fatigue in responses and found that Ht, 
Cstar, and U3 performed better than OUTFIT and INFIT. 
Recently, another comparative study (Beck, Albano & Smith, 
2018) measured response time and mentioned person-fit 
statistics to detect inattentive responding. Again, Ht was found 
to have the highest AUC. However, a major limitation of using 
person-fit indices for detecting careless responding to formative 
assessment items is that they only classify the entire response 
vector for a quiz as careless or not. However, it may be that only 
part of the responses to an assessment was careless, such as the 
last few items due to respondent fatigue. To overcome this 
limitation, we propose the random forest approach that utilizes 
person-fit indices as predictors of careless responding. 


Over the last decade, as educational datasets have become larger, 
alongside substantial increases in computation speed, 
researchers have shown interest in using more complex machine 
learning classifiers. The random forest was first introduced by 
Breiman (2001) to overcome the problems of boosting (i.e., fitting 
trees to bootstrap-resampled data, e.g., Shapire et al., 1998) and 
bagging (ie., bootstrap aggregating: fitting trees to reweighted 
data; Breiman, 1996) by adding another layer of randomness to 
bagging. The advantage of the random forest is that it chooses 


the best predictor among a subset of randomly selected 
predictors for splitting a node, while in a standard tree, the best 
predictor is chosen among all variables (Liaw & Wiener, 2002). 
By using this strategy, the random forest is robust against 
overfitting, and it outperforms other classifiers, such as 
discriminant analysis, support vector machines, and neural 
networks in some situations (Breiman, 2001). 


One major advantage of combining the random forest with 
person-fit indices to detect careless responding is_ that 
classification of responses as careless or not can be done at the 
item response level rather than the person-level. Therefore, for 
the vector of responses a student provides for a formative 
assessment, only some responses may be classified as careless. In 
other words, carelessness can be viewed less as a static person 
feature on a test and more as a feature of the interaction between 
a particular person and a particular assessment item. Another 
advantage of the proposed method is that it can simultaneously 
use multiple person-fit indices, allowing optimal use of each 
index’s unique sensitivity to careless responding. 


2. METHODS 
2.1 Participants 


The sample for this study consisted of item responses from 14474 
students obtained from the Algebra 1 section of Math Nation 
during the time that face-to-face instruction in schools in the 
state of Florida was canceled due to the threat of COVID-19 
infection and replaced by online instruction. More specifically, 
the period that responses were collected was from March 18 to 
May 31, 2020. Math Nation focuses on preparing students for 
taking the high-stakes Algebra 1 End-of-Course exam required 
for high school graduation, usually administered in May by the 
Florida Department of Education. Because of COVID-19, the 
Algebra 1 End-of-Course exam was canceled, removing an 
important motivator for students to use Math Nation. However, 
Math Nation usage spiked during this period because teachers 
could use it as a resource to teach algebra while students were 
attending classes virtually. Therefore, this sample is well-suited 
for studying careless responding because students were making 
heavy use of a VLE and its assessment features, and yet they did 
not have the pressure of practicing for the high-stakes 
achievement test at the end of the school year, and they were 
engaging in schooling from home, which may expose them to 
many distractions. 


2.2 Measures 

There are a total of 10 sections in Math Nation, which 
corresponds to major concepts in Algebra, such as linear 
functions and quadratic functions. Each section has between 6 
and 12 formative assessments with 3 items and one formative 
assessment with 10 items randomly drawn uniquely for each 
student from an item pool. The students can take these 
assessments as many times as they like. This study’s data 
included responses to 40 items of the item pool of the 10-item 
formative assessment of “Section 9: One-Variable Statistics” of 
Math Nation. We chose this section because it had the highest 
number of responses to items during the time period of interest. 
We chose to focus on the 10-item assessment rather than the 3- 
item ones because teachers frequently ask students to complete 
the 10-item assessment before moving forward to the next 
section of Math Nation. Once completed, students have the 
option to review their answers and watch solution videos for 
each item. 
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The items used in this study had multiple formats (e.g., single 
choice, multiple-choice, constructed response) but were scored 
dichotomously (ie., correct, or incorrect). The items were 
written to mimic the content and format of items on the 
statewide Algebra 1 End-of-Course exam and have been found in 
our research project to correlate strongly to scores on that exam. 
Difficulty and discrimination parameters for the items in the 
study were taken from the estimates reported in a previous 
study (Xue et al., 2020), which used the 2-parameter logistic 
(2PL) item response theory model (IRT; Birnbaum, 1968). 


2.3 Analysis 


In the current study, we compared the effectiveness of 14 
nonparametric and parametric person-fit statistics (see Table 1) 
as well as the performance of the random forest classifier to 
capture carelessly answered responses. These person-fit indices 
are available in the PerFit (Tendeiro, Meijer, & Niessen, 2016) 
package of R statistical software (R Core Team, 2018). To obtain 
nonparametric person-fit indices, no IRT model is required. 
However, for the parametric lz and Izstar, the item parameters 
(i.e., difficulty and discrimination) for a 2PL IRT model were 
obtained from Xue et al. (2020) and used to estimate the student 
ability parameters. The 2-PL model formula is 


exp! 74i(8s-bi) 


1+ exp)74i@s—bi) 


P(¥is = 185) = 


> 


where P indicates probability, 6 is a latent trait of ability, the 
subscript i indicates an item, Y is an item response, b is item 
difficulty, a is item discrimination, the subscript s indicates a 
student, and 1.7 is a scaling constant. This formula gives us the 
probability of correct response (Y=1) for item i and person s 
conditional on the individual’s ability. 


All 14 person-fit indices were computed using the PerFit 
(Tendeiro, Meijer, & Niessen, 2016) package for all selected 
individuals in the sample and the result of the person-fit indices 
(person-fit scores) was saved to be used as part of the predictors 
for the random forest. Additional predictors were item difficulty 
and discrimination estimated parameters, the number of items 
answered by each student, ranging from 10 to 40 items, and the 
number of correct items answered (from zero to 10). Within 
“Section 9: One-Variable Statistics”, students could respond to all 
available 40 items by taking section tests multiple times. Only 
the responses for students who answered at least 10 items were 
retained for the analysis. The total number of answered items 
out of 40 were available in the dataset and used as a predictor of 
carelessness in the random forest. 


Random forest was implemented with the randomForest (Liaw & 
Wiener, 2002) package in R. To identify carelessness in responses 
for the model training, the time taken by each student to answer 
each item was recorded, and by comparing the graph of the 
frequency of correct responses versus incorrect responses across 
time for each item, empirical cutoffs were determined. In most 
cases, the point in time where the frequency of incorrect 
responses was decreasing, and the frequency of correct 
responses started to increase, was chosen as a cutoff for 
carelessness. For example, for item two, a cutoff of eight seconds 
was determined (see blue dashed line in Figure 1). Therefore, 
student responses to item two that were recorded in less than 
eight seconds were coded as careless. 
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Figure 1: Frequency graph of correct responses versus 
incorrect responses across time for item 2. 


To evaluate predictions by person-fit indices and random forest, 
we obtained the Receiver Operating Characteristic (ROC) curve. 
The ROC curve compares sensitivity against the specificity of a 
predictor for dichotomous data (Hanley, & McNeil, 1982). 
Sensitivity is the ability to identify true careless responding (true 
positive rate), and specificity is correctly identifying those items 
not careless (true negative rate). Usually, on ROC curves, “1- 
specificity” is demonstrated, which identifies the Type I Error 
rate (false positive rate; Hanley, & McNeil, 1982). The AUC of a 
ROC curve ranges from 0.5 to 1 from the identification line (ie., 
the diagonal on the ROC) and represents the accuracy of the 
predictor or the feature, and different values of it can be 
interpreted to show the strength of a test. Generally, AUC that is 
0.90-1 indicates an outstanding test, 0.80-0.90 is considered 
excellent, 0.70-0.80 is an acceptable one, and the AUC of about 
0.5 suggests no discrimination (Hosmer, Lemeshow, & 
Sturdivant, 2013). 


Creating ROC plots for person-fit indices requires the “true” 
careless responses (see above for flagging careless responses 
with time cutoffs) to compare them with estimated careless/non- 
careless responses by person-fit indices and calculate AUC. For 
person-fit indices, we defined a person as careless if he/she has 
responded to at least one item out of 10 carelessly based on the 
specified cutoffs. Regarding random forest ROC, true labels for 
carelessness were calculated per item response, and each student 
received 10 true labels of careless/non-careless for 10 answered 
items. 


3. PRELIMINARY RESULT 


Person-fit indices have been used as a measure to detect careless 
responding in previous studies (e.g., Niessen et al., 2016; Patton 
et al., 2019). In this study, all 14 available person-fit indices in the 
PerFit package were calculated for all students in the data, and 
the result are shown in Table 2. Most of these person-fit indices 
indicated a very poor AUC around 50%, which suggests no 
discrimination between  careless/non-careless _ students. 
Therefore, only indices with the three best AUC were retained 
for later comparisons: 1) Guttman errors with 54.7%, 2) 
agreement statistic with 69.1%, and 3) dependability statistic with 
63.5%. It is noteworthy that even the best performing index of 
agreement statistic with 69.1% AUC is not discriminating enough 
to be considered as an acceptable classifier of careless/non- 
careless students based on Hosmer, Lemeshow, and Sturdivant 
(2013) proposed cutoff of 70% AUC. 
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Table 2: AUC of 14 person-fit indices 


Person-fit index PerFit function | AUC (%) 
Nonparametric 

Personal point-biserial correlation r.pbis 50.2 
Caution statistic CSato 50.3 
Modified caution statistic Cstar 52.4 
Number of Guttman errors G 54.7 
Normalized Guttman errors Gnormed 49.9 
Agreement statistic A.KB 69.1 
Disagreement statistic D.KB 51.6 
Dependability statistic E.KB 63.5 
U3 statistic U3 52.4 
Standardized normal U3 ZU3 50.1 
Norm conformity index NCI 50.1 
HT statistic Ht 50.4 
Parametric 

Standardized normal loglikelihood lz 50.1 
Corrected Iz lzstar 50.1 


The random forest has the advantage over person-fit indices in 
that it can use multiple person-fit indices as predictors but also 
include other predictors. The random forest included the three 
person-fit indices with the highest AUC and four additional 
predictors: estimated item difficulty and item discrimination 
parameters, number of correct responses, and number of items 
taken within the section. AUC of ROC illustrated that random 
forest with the set of specified predictors improved the 
classification and achieved the AUC of 77.2%, which is an 
acceptable test to distinguish between careless/non-careless 
responses (see Figure 2). The random forest also outperformed 
the best person-fit agreement statistic by about 8%. 
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Figure 2: AUC of ROC plot for random forest classifier 


4. DISCUSSION AND CONCLUSSION 


The study objective was to compare the detection of careless 
responses between person-fit indices by themselves and random 
forest, including fit indices and other predictors. From the 
obtained results, we can conclude that random forest more 


accurately predicts careless responding, and this research took a 
methodological step forward in automatic identification of 
careless responding. By introducing different predictors, multiple 
dimensions are available, and a random forest can be used to 
investigate different parts of the data. In addition, careless 
responding can be conceptualized and examined as an item- 
person interaction rather than a static person feature. If desired, 
the random forest results can be aggregated to the person level 
(e.g., the proportion of responses from each person that were 
flagged as careless), allowing additional information at the 
person level. At this level, students can be labeled as careless 
responders or non-careless. 


Person-fit indices as independent measures of carelessness may 
face some problems regarding ROC analysis. Sometimes, the true 
positive rates against false positive rates swing, cancel out, and 
end up in a very low AUC around 50% that is no discrimination. 
However, this issue does not occur in the random forest because 
of the way trees are constructed. Person-fit indices (and other 
predictors) can be removed from the random forest when they 
do not help improve classification accuracy. 


One limitation of this study is that we relied on time cutoffs of 
each item response to create the criterion for careless 
responding. Like all options for “true” carelessness in responses, 
one could argue that our time-based flags of “true” careless 
responses are fallible in their own ways. However, empirical 
thresholds or time cutoffs have been used many times in 
previous research to detect careless responding (e.g., Wise & 
Kong, 2005; Wise & DeMars, 2010; Wise, 2015; Wise, 2017; Rios 
& Soland, 2020). Alternative criteria could come from surveys of 
students after they complete each formative assessment. 


The result of this research will eventually be available as a 
trained random forest model to be used for applied researchers 
to detect careless responding in their data. They could enter the 
raw data to an R package to be developed in the future, and for 
the number of items in their test, they obtain a prediction of 
whether each answer is careless or non-careless. Then, within 
the context of their research, they can decide how they would 
like to aggregate and interpret carelessness (i.e., at the person 
level or at the item level) and make decisions according to the 
available results. 
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ABSTRACT 


Study session dropout prediction allows for educational systems to 
identify when a student would stop a study session which gives vital 
information to prolong learning activity. Student session dropout 
can depend on many factors that are involved with the engage- 
ment when using the system. The student’s knowledge level and 
their track records within the system are closely related to the stu- 
dent’s willingness to continue with their study. Knowledge trac- 
ing as a task models the user’s knowledge level given study his- 
tory. The information from knowledge tracing can have significant 
impact on predicting the student’s willingness to continue, which 
is why it is natural to train two tasks jointly for better general- 
ization in dropout prediction task. While extensive research has 
been conducted individually on dropout prediction and knowledge 
tracing, the effect of jointly modeling two tasks has not been thor- 
oughly investigated. Hence, we show that multi-task training of 
the study session dropout prediction model along with knowledge 
tracing boosts the performance of study session dropout predic- 
tion, especially on more challenging tasks and datasets. Specifi- 
cally, with Transformer-based models, multi-task training signifi- 
cantly improves Area Under Receiving Operator Curve (AUROC) 
by 3.62% in further N-step dropout prediction task, which is a 
study session dropout prediction task under a more practical set- 
ting. Moreover, under label-scarce and class-imbalance settings, 
our method shows improvements of AUROC up to 12.41% and 
11.22%, respectively. Our results imply that knowledge tracing is 
closely related to study session dropout prediction and can transfer 
positive knowledge in multi-task training, which provides a new 
way to better predict dropouts especially in difficult settings. 


Keywords 


Dropout prediction, Multi-task Training, Knowledge Tracing 


1. INTRODUCTION 


The advantages of e-learning has gathered the attention of both ed- 
ucators and researchers. One of the lasting problems in e-learning 
is the ability to maintain the user’s attention during the system use. 
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Study Session Dropout Prediction”. 2021. In: Proceedings 
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For instance, students in mobile learning environments are more 
prone to distractions and exhibit difficulties in concentration [16, 
20, 5]. Thus, being able to properly identify when these issues oc- 
cur will allow an Intelligent Tutoring System (ITS) [1] to appropri- 
ately and preemptively intervene. This task is called Study Session 
Dropout Prediction and has been recently proposed by [21]. Pre- 
dicting such session dropout is a crucial task in Educational Data 
Mining (EDM) to understand student’s behaviors and learning en- 
vironments, which can lead to increased learning effect. 


However, study session dropout prediction has not yet been exten- 
sively studied. Many recent research works have instead focused 
on predicting student dropout in environments like universities or 
Massive Open Online Courses (MOOC) [3, 13, 27, 33, 35]. Inter- 
estingly, [22, 9, 26] has also shown that one of the main reasons 
students drop out from schools or classes is their academic perfor- 
mance which is highly relevant to their knowledge states. Given 
such knowledge, we hypothesize that study session dropout can 
also be attributed to the knowledge states of students. 


Hence, in this paper, we jointly model study session dropout pre- 

diction with knowledge tracing, which is a heavily studied task 

that [12, 23, 8] that aims to predict the student’s future performance 

on knowledge components (e.g. questions or concepts) given the 

student’s historical data. In this study, we address this issue through 

a machine learning methodology known as multi-task learning [4]. 

Multi-task learning jointly trains multiple tasks together to formu- 

late a comprehensive understanding of the nature of the data. Specif- 
ically, we implement a multi-task training model that is trained with 

both session dropout prediction and knowledge tracing. 


The contributions of this paper are as such: 


e We provide a multi-task training framework to jointly model 
study session dropout prediction and knowledge tracing. 


e We show that our multi-task training framework boosts the 
performance of the trained model on study session dropout 
prediction. Also, we show that our method elevates the per- 
formance in further N-step dropout prediction task, where 
the model has to predict dropouts not only in immediate time 
step, but also in future time steps. 


e We perform extensive ablation studies to show that multi- 
task training shows even higher performance on more dif- 
ficult experimental settings such as label scarcity and class- 
imbalance. We also show with ablation studies that the thresh- 
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old of lag time that we use to define session dropout also 
affects the performance of multi-task training. 


2. RELATED WORKS 
2.1 Dropout prediction 


There have been many attempts to predict dropouts in various envi- 
ronments. Traditionally, dropout prediction has been incorporated 
to predict student dropouts [3, 27]. Following the proliferation of 
internet, dropout prediction became applicable in online services. 
(2, 15] incorporated deep learning methods in Spotify Sequential 
Skip Prediction Challenge, where the task is to infer the songs that 
will be skipped in the second half session after the first half. Mod- 
els such as denoising autoencoder and variants of Long Short Term 
Memory (LSTM) network were utilized. [13, 33, 35] used deep 
learning methods such as Convolutional Neural Network (CNN) to 
predict students’ dropouts in MOOC. Study session dropout pre- 
diction has been studied to discover student’s involvements in mo- 
bile learning environments. [21] utilized the Transformer network 
[31], which replaced the recurrent architecture of Recurrent Neural 
Networks (RNN) by self-attention blocks, to predict the study ses- 
sion dropout probability in mobile learning environment. We used 
Deep Attentive Study Session Dropout prediction (DAS) model in 
(21] with multi-task training approach in our experiments. 


2.2. Multi-Task training 

Multi-task training or multi-objective training is a method to train a 
machine learning model with multiple objectives [37], which tries 
to enhance the performance on original task by sharing features 
with auxiliary tasks. It has been used across various domains in 
machine learning, such as Computer Vision and Natural Language 
Processing. For example, in [24], the authors used multi-task train- 
ing for face labeling by training a CNN to handle both likelihoods 
and pairwise label dependencies. Multi-Task Deep Neural Network 
(MT-DNN, [25]), is a BERT [11] based model with several task- 
specific layers for multi-task learning, which outperforms vanilla 
BERT’s performance on the GLUE benchmark [32]. 


There are also several applications of multi-task training in an edu- 
cational field. Huang et al. [18] presents a transformer-based model 
that identifies whether a given voice of a teacher corresponds to a 
question or not, which solve the problem as a multi-class classifi- 
cation problem to recognize question types. Geden et al. [14] pro- 
posed LSTM-based model to predict correctness rate of all ques- 
tions instead of the average correctness rate for the related ques- 
tions. In [19], Huang et al. suggests Deep Reinforcement Learning 
based exercise recommendation system whose reward function is 
designed to satisfy multiple objectives. 


2.3 Knowledge Tracing 

Knowledge tracing is a task of modeling students’ knowledge level 
given their learning activities. Knowledge tracing and dropout pre- 
diction shares the aspect that they both model students’ responses 
given their learning histories. Bayesian Knowledge Tracing (BKT) 
is a traditional method which treats student’s learning activities 
as binary variables representing whether the student understands 
a certain concept or not [36]. Some works proposed to incorporate 
deep learning methods in knowledge tracing. [29] feeds the users’ 
one-hot encoded learning activities into RNN-based model archi- 
tectures to output the correctness prediction probability. [7, 28] are 
the works that use Transformer-based architectures for knowledge 
tracing. SAINT [7] has a similar architecture to DAS, which uses 
Transformer’s both encoder and decoder structure. 


3. METHODS 
3.1 Study Session Dropout Prediction 


Formally, a student’s learning history is given as a sequence of in- 
teractions 


i= (FD, 72), 7), i ., I?) 


where each [9) = (eG ),1) includes meta-data of the question 
e) that a student solves at j-th step (e.g. question id, category of 
the question, question text, ...) and the meta-data of the student’s 
response 1) (e.g. response correctness, elapsed time, timeliness, 
...) at j-th step. Then the study session dropout prediction is to 
estimate the probability 


Pye aaa Te 


that the session dropout occurs after solving 7-th question. Note 
that a sequence can contain multiple sessions. As in [21], we define 
one-hour inactivity as a session dropout, so that the dropout label 
at j-th step is given by 


DP — : 
0 otherwise 


G) is WD := stG+) — 5¢@ > 1 hour 

where st is the start time at j-th step, ie. the time that user 
start to solve the question, and it is the lag time for the j-th 
interaction. 


3.2 Input Representation 


The representation of each interaction [?) = (eM 19 ) is formu- 
lated similarly to the settings in [21]. Here are some minor differ- 
ences of feature settings between our model and the original DAS 
model in [21]: 


1. Instead of start time, we use the lag time feature. It is more 
directly related to the dropout and leads to the substantial 
gain in the model’s performance. Since the distribution of a 
lag time is long-tailed, we use the logarithm of the lag time 
instead of the lag time itself. (See Figure 2 for the distribu- 
tion of the lag time). It is used as a decoder’s input, not for 
an encoder. 


2. We use continuous embedding for elapsed time, instead of 
discrete embedding. More precisely, we first clip the actual 
elapsed time with maximum 300 seconds, then normalize it 
by dividing it with 300. After that, we get a latent embed- 
ding vector for the elapsed time et by v = v(et) = et- Wee, 
where Wez is a single trainable vector which has same di- 
mension as the model. 


3.3. Model 


In this section, we describe our methodology to jointly perform 
training in dropout prediction and knowledge tracing. We use the 
shared model f to generate the shared feature representations for 
both dropout prediction and knowledge tracing. An arbitrary model 
f takes the sequences of question embeddings e = fe, ed 
and response embeddings 1 = [I™,...,19-)] to produce the 
shared feature representation for dropout prediction and knowledge 
tracing. Then, the feature representation is fed into the final sepa- 
rate prediction layers to output predicted dropout probabilities and 
response correctness: 


gpp = 0(Wor(f(e,!)) + bop) 
Oxr = 0(Wxr(f(e,1)) + ber) 
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Further Steps N = 1 (Positive label 4.06%) 


N = 5 (Positive label 18.34%) 


N = 10 (Positive label 32.52%) 


model AUROC AUPRC AUROC AUPRC AUROC AUPRC 
vanilla multi-objective — vanilla == multi-objective —vanilla_~—s multi-objective — vanilla ~—s multi-objective — vanilla =~—s multi-objective —vanilla_—smulti-objective 
LSTM 0.8704 0.8717 0.3208 0.3229 0.7586 0.7599 0.4577 0.4598 0.736 0.7367 0.5880 0.5886 
GRU 0.8652 0.8670 0.3083 0.3133 0.7567 0.7570 0.4547 0.4556 0.7338 0.7349 0.5842 0.5855 
DAS 0.8807 0.8836 0.3469 0.3534 0.7579 0.7734 0.4598 0.4815 0.7229 0.7491 0.5717 0.6061 


Table 1: Test AUROCs and AUPRCs of DAS and RNN-based dropout prediction models. Further steps N of each task and its positive label proportions of 
the dataset are indicated in the top row (NV = 1 corresponds to the original session dropout prediction task). Best result for each model is indicated in bold. 


( > 


(3) 


YT 
Dropout Correctness 
Prediction Layer Prediction Layer 


t 


Shared Model 


t 


(e,8) (7,29) 


(e), coy, 


1) 


e) = sum(gid , gp , p , sp) 


(3) 
K 


19 = sum(p, sp), y),, et, tot), yo, 


Figure 1: Overall architecture of our multi-task training scheme. Note that 
s is the starting token. 


The major difference between our method and the previous dropout 
prediction models is that we are jointly training the model to predict 
both student dropout and response correctness, by using different 
prediction layer for each task at the end of the shared model. Using 


separate prediction layers, we produce both Ypp = (a3, ts G2) 


and jkr = (o., Loy oo), which are predicted probabilities for 
study session dropout (Yop) and response correctness (YT) for 
each time step. Training scheme of our approach is described in 
Figure 1. The major baseline that we use for our methodology is 
DAS [21], which is a Transformer-based model to predict study 
session dropout. The details of the architecture of DAS is described 
in Appendix A. We also do experiments with RNN-based model 
architectures - including LSTM and GRU [6, 17] - which are pro- 
vided as baselines in [21] for comparison. For RNN-based models, 
we use encoder-only structure instead of encoder-decoder structure. 


3.4 Training objectives 

Typically, Binary Cross-Entropy (BCE) loss is used in 2-class clas- 
sification tasks, which include the cases of dropout prediction and 
knowledge tracing. We use the BCE function to compute £pp and 
Lxt, which are the losses for dropout prediction and knowledge 
tracing. We train the model with the loss 


L=Lppt+ AxtlLKr 


where Ax rr is a balancing hyper-parameter. Our experiments are 
performed with Axr = 0.5. 


4. EXPERIMENTS 
4.1 Experiment setup 
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Figure 2: Distribution of data points with respect to the lag time. The lag 
time in the graph ranges from 600s (10 minutes) to 7200s (2 hours). The 
number of data points exponentially decays as their lag time increases. 


waek ete Positive AUROC AUPRC 
Tabel Proportion vanilla multi-objective — vanilla. ~—_— multi-objective 
50% 3.80% 0.8381 0.8832 0.2599 0.3521 
90% 3.80% 0.7752 0.8740 0.1602 0.3304 
95% 3.80% 0.7264 0.7598 0.1202 0.1322 
99% 3.81% 0.6826 0.6837 0.0984 0.0901 
50% 1.90% 0.8418 0.8833 0.2775 0.3520 
90% 0.38% 0.7677 0.8204 0.1504 0.2093 
95% 0.19% 0.7175 0.7980 0.1163 0.1726 
99% 0.04% 0.6782 0.7304 0.0923 0.1095 


Table 2: Test AUROCs and AUPRCs of the DAS model with various mask- 
ing rates on both labels (first 4 rows) and only positive dropout labels (last 
4 rows). The first two columns indicate the rate of random masking and the 
proportion of labels in the training data for each mask rate. Best result for 
each masking rate indicated in bold. 


We use the EdNet-KT1 dataset [8], the largest publicly available 
student interaction dataset collected by Santa*, which is a mobile 
application for preparing Test of English for International Commu- 
nication (TOEIC) exam. The proportion of the logs where session 
dropout occurred was 4.06% with the definition of dropout as one- 
hour lag time. The distribution of dropout labels w.r.t. the change 
of lag time in defining the dropout is described in Figure 2. 


For RNN-based model architectures, we use the embedding and 
model dimension size of 256 and feedforward layer dimension size 
of 1024 with 2 number of layers. For DAS, we use the embed- 
ding and model dimension size of 512 and feedforward layer di- 
mension size of 2048 with 4 number of layers. While training, 
we set the model’s input sequence size as 100, and all the mod- 
els are trained with Adam optimizer with Noam scheduling where 
the warmup step is 40000. We set the initial learning rate and the 
model’s dropout rate as 0.001 and 0.1 respectively. 


We evaluate our models with two metrics: Area Under Receiving 
Operator Curve (AUROC) and Area Under Precision Recall Curve 
(AUPRC). AUROC is the most widely used metric in the litera- 
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Positive AUROC AUPRC 


lag time A 
‘ Babel Fropordion vanilla multi-objective = vanilla. ~—s multi-objective 
600s 6.33% 0.8853 0.8876 0.4327 0.4387 
1800s 4.61% 0.8786 0.8821 0.3582 0.3657 
3600s 4.06% 0.8807 0.8836 0.3469 0.3534 
5400s 3.79% 0.8768 0.8857 0.3309 0.3519 
7200s 3.60% 0.8789 0.8876 0.3298 0.3508 


Table 3: Test AUROCs and AUPRCs of the DAS model with various stan- 
dards of lag time on defining dropouts (in seconds). The first two columns 
indicate the lag time used to define session dropout and the proportion of 
positive labels in the total dataset for each definition. Best result for each 
indicated in bold. 


ture for evaluating the dropout prediction models because labels in 
dropout prediction settings are usually imbalanced. It is known that 
AUPRC is especially more informative than the AUROC when the 
dataset’s labels are imbalanced [10, 30]. 


We evaluate the effect of multi-task training on two tasks. The 
first is the standard study session dropout prediction described in 
3.1, where the task is to estimate the probability that the session 
dropout occurs after the current time step. However, in the ac- 
tual service, it is more important to predict whether the user will 
dropout within several future time steps in order to respond to the 
user’s engagement in advance. Thus, we also evaluate our method 
on further N-step dropout prediction task, which predicts whether 
the user will dropout within further N time steps. In further N- 
step dropout prediction, the number of future time steps that the 
model has to consider increases as N increases. We perform fur- 
ther N-step dropout predictions with N € {5,10}. For all tasks, 
we measure AUROC and AUPRC on LSTM, GRU, and DAS with 
and without multi-task training to validate the effect of our method. 


4.2. Main results 


The results of multi-task training on the study session dropout pre- 
diction are given in Table 1. The multi-task training with knowl- 
edge tracing improves AUROC and AUPRC for the dropout pre- 
diction across all models. We also present the effect of our method 
in further N-step dropout predictions in Table 1. The results show 
that in further N-step dropout prediction tasks, multi-task training 
increases the performance of the model by larger margins than in 
immediate dropout prediction task. Note that multi-task training 
shows higher increase in AUROC when N = 10 since the future 
steps that the model has to consider increases with N, leaving more 
room for multi-task training to help the model. Table 1 also in- 
cludes the proportion of positive labels of the dataset in each task 
to explain the difference of AUPRCs between the tasks. Although 
further 10-step prediction shows lower AUROCs compared to other 
tasks, since its dataset is less imbalanced, it shows higher AUPRCs. 


4.3 Ablation study 


We performed ablation studies on immediate study session dropout 
prediction task for fair comparisons. We perform ablations on la- 
bel scarcity, imbalanced datasets, and various standards on dropout 
definition as follows. 


4.3.1 Scarce Label for Dropout Prediction 

It has been known that multi-task training shows higher perfor- 
mance when the label of target domain is scarce [34]. To verify 
this notion, we evaluated the multi-task training on datasets with 
different levels of dropout prediction label scarcity. Specifically, 
we randomly masked out both positive and negative dropout labels 


in different proportions ranging in {50%,90%,95%,99%}. The re- 
sults in Table 2 shows that multi-task training indeed shows higher 
performance when dropout prediction labels are scarce. Since at 
least some amount of labels are needed for the models to converge, 
the result with 99% mask rate fails to show meaningful results. 


4.3.2 Imbalanced Dataset 

As we mentioned before, study session dropout prediction usually 
suffers from the imbalanced dataset. In our case, the rate of the pos- 
itive label is only 4.06% of the total data. We conjecture that our 
multi-task training approach is also helpful when the label of the 
dataset is extremely imbalanced. To show this, while training, we 
randomly masked out certain proportion of positive dropout labels 
during training, and evaluated the model on the same validation and 
test set as before. Note that this is different from 4.3.1 since 4.3.1 
performs random masking on both positive and negative dropout 
prediction labels. The proportion of random masking also ranges 
in {50%,90%,95%,99%}. The results are given in the Table 2. Re- 
sults show that multi-task training outperforms vanilla model more 
heavily on imbalanced datasets. 


4.3.3 Definition on Dropout 

Although we define one-hour (3600s) inactive lag time as a session 
dropout, other definitions of a dropout may be utilized to better 
analyze student’s learning activities. Thus, we see how the effect 
of multi-task training varies with the change in the definition of a 
session dropout. Figure 2 shows the distribution of the number of 
dropout labels w.r.t. the inactivity duration (lag time). We compare 
the results with various lag time standards of a session dropout in 
{600s, 1800s, 3600s, 5400s, 7200s}. The results are given in Ta- 
ble 3. Results show that multi-task training tends to perform bet- 
ter in tasks with higher inactivity duration standards of a session 
dropout. This is because the tasks with higher lag time standards 
have more imbalanced datasets. Since imbalanced datasets tend 
to have lower AUPRC, tasks with higher lag time standards have 
lower AUPRCs. 


5. CONCLUSIONS 


In this paper, we proposed a multi-task training approach with knowl- 
edge tracing to boost the performance of study session dropout 
prediction. We hypothesized that the commonality between the 
dropout prediction and knowledge tracing tasks would be beneficial 
to predict dropouts. We empirically validated with Transformer- 
based and RNN-based models that multi-task training enhances 
the dropout prediction performance especially in further N-step 
dropout prediction, which is a more practical task in real service. 
Moreover, we performed extensive ablation studies to demonstrate 
that multi-task training shows even better performance on more dif- 
ficult experimental settings. We remain the multi-task training with 
other tasks in the field of Artificial Intelligence in Education (AIEd) 
as the future work. 
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Figure 3: Overall architecture of DAS. Note that s is the starting token and 
N is the number of layers. 
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A. ARCHITECTURE OF DAS 


In this section, we review the overall architecture of Deep Atten- 
tive Study Session Prediction (DAS) [21], which we use for the 
baseline of our model. DAS is a transformer-based model that con- 
sists of an encoder and a decoder. Encoder includes N encoder 
blocks where each block has a multi-head self-attention layer and 
a fully connected feed-forward layer. After each layer, residual 
connection and layer normalization are applied. Decoder also in- 
cludes N decoder blocks with each block including a multi-head 
self-attention layer, a multi-head encoder-decoder attention layer, 
and a fully connected feed-forward layer. The encoder-decoder 
attention layer takes the output of the encoder as keys and val- 
ues, and output of self-attention layer as queries to perform the 
attention mechanism. Each layer in the decoder block is also fol- 
lowed by residual connection and layer normalization. The encoder 
takes the sequence of question embeddings e = [eo yee ed | 
and produces the outputs h = [h,...,h] that are fed into 
the decoder’s encoder-decoder attention layers. The decoder takes 
the sequence of response embeddings | = [s, [Oe 23 19-))] and 
encoder’s outputs h, producing the hidden vectors which are fed 
through the final linear layer to output the predicted dropout prob- 
abilities gop = [99),,...,9\%,]. Note that s is the starting token 
for the first position of the sequence. The overall process of DAS 
can be described as: 


h = Encoder(e) 
9pp = 0(WopDecoder(s, 1, h) + bpp) 


where s is the start token embedding. The overall architecture of 
DAS is described in Figure 3. 


We will now describe the components of each block in encoder and 
decoder. Each block mainly consists of a multi-head attention layer 
and a fully connected feed-forward layer. Multi-head attention net- 


work in each block takes queries, keys, and values of the sequence 
as inputs. Queries, keys, and values of head; are computed by mul- 
tiplying weight matrices we, W, WY to the inputs as follows: 


Q=0gW? =O iQ”) 
K, =exWk =[K™,...,K] 
Vscew = (Vi, oer Vi] 
Then, multi-head attention with h attention heads is computed as: 
Multihead(eg, ex, ev) = Concat(head;,..., headn)W? 


Qik? 
Vdk 


dx, is the dimension of K;, which is incorporated for scaling. w? 
is the matrix to combine the outputs from multiple attention heads 
and to produce the final output of multi-head attention mechanism. 
Note that multi-head self attention uses same inputs to compute 
queries, keys and values while multi-head encoder-decoder atten- 
tion uses outputs from the encoder as keys and values, which can 
be expressed as Multihead(l, h,h). In order to prevent cheating 
from the future time steps, subsequent masks to the attention layers 
are incorporated. The fully connected feed-forward network ap- 
plies linear transformation after adding non-linearity to the outputs 
of the multi-head attention layer as follows: 


FFN(M) = ReLU(MW1 + b1)We + be 
where M = Multihead(eq, ex, ev) 


Wi 


where head; = Softmax( 
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ABSTRACT 


The time that students spend on assignments, i.e. time-on- 
task, has been used frequently in prior research to under- 
stand student affect, study habits, and course performance, 
among others. The choice for how time-on-task is calculated, 
however, is typically based on available data. This data can 
be very coarse-grained, such as the timestamps from stu- 
dents’ assignment submissions. Using coarse-grained data to 
calculate time-on-task has limitations, such as not being able 
to determine whether students take breaks when working on 
an assignment. In this work, we analyze the differences be- 
tween two time-on-task metrics, one based on coarse-grained 
data—in this case, student submissions—and one based on 
fine-grained data—in this case, students’ keystrokes during 
an assignment. We compare these two metrics and exam- 
ine how well they correlate to find out whether time-on- 
task based on coarse-grained data can be an accurate metric 
for understanding the time spent by students on an assign- 
ment. Our results show that the correlation between the 
two metrics that are supposed to measure the same under- 
lying phenomena—time-on-task—is only weak to moderate. 
This suggests that fine-grained data might be needed to ac- 
curately estimate time-on-task. 


Keywords 

time-on-task, fine-grained data, coarse-grained data, data 
granularity, keystroke data, programming process data, learn- 
ing analytics, educational data mining 


1. INTRODUCTION 


Time-on-task—the amount of time that a student spends 
actively engaged in a task—is considered as one of the most 
important factors that contribute to learning and achieve- 
ment [14, 30,32]. Measuring time-on-task focuses on iden- 
tifying active time that is spent on a task, instead of the 
overall time that includes breaks and time spent on unre- 
lated activities. Time-on-task has been measured through 
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various means: student self-reports [27], stopwatches [8], pe- 
riodic observations [3], video recordings and eye movement 
data [4], and learning management system log data [17]. 
While all of these can be considered as proxies for time-on- 
task, accurately estimating time-on-task remains a challeng- 
ing problem that deserves further attention [14, 17]. 


In this work, we study (a) to what extent two different 
types of log data—timestamped keystroke data and times- 
tamped submission data from an introductory programming 
course—can be used to measure students’ time-on-task and 
(b) to what extent time-on-task estimates produced with 
this data represent the same phenomenon. Our work is mo- 
tivated by the need to distinguish between different types of 
data and the time-on-task estimates that can be produced 
with them. As numerous metrics have been used as proxies 
for time-on-task, if these metrics are not in line with each 
other, results from studies using them may not be compa- 
rable. That is, differences between observed results, or even 
contradictory results, could be explained to some extent by 
the difference in the chosen time-on-task metric. 


Some studies similar to ours include work by Kovanovié 
et al. [16] and Nguyen [23]. Kovanovié et al. [16] built and 
compared a range of time-on-task metrics for evaluating 
students’ performance, highlighting methodological issues. 
Nguyen [23], on the other hand, evaluated methods for iden- 
tifying off-task behavior, also correlating the resulting esti- 
mates with academic performance. While these studies have 
used click-stream or event data from learning management 
systems such as Moodle, the data in our study comes from 
an introductory programming course where work on pro- 
gramming assignments is logged keystroke by keystroke. 


This article is structured as follows. In Section 2, we discuss 
related time-on-task studies, starting with an overview of 
earlier studies on time-on-task and time-on-task estimates 
within learning programming, with a brief outline of studies 
that have analyzed different time-on-task estimates. We de- 
scribe our context, data, research questions, and metrics in 
Section 3, and outline the analyses and results in Section 4. 
We discuss our findings and outline future work in Section 5. 


2. RELATED WORK 


2.1 Measuring Time-on-Task 
Early work with time-on-task often involved on-site obser- 
vations (e.g. in classrooms) where coders manually recorded 
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and/or timed behaviors based on a coding rubric of on-task 
behaviors, also linking teacher’s behavior with students’ be- 
havior (e.g. [2,10,15]). Over time, technology advancement 
led to the now-prevalent practice of mining user interac- 
tion logs from educational software used in classrooms or 
technology-augmented learning activities. This has made 
the analysis of user logs a vital component of more recent 
learning and behavior studies, such as in predicting help- 
seeking behavior [5] or assessing performance [9]. Time-on- 
task studies have likewise turned to this direction. Some 
examples (proxies for time-on-task in parentheses) include: 
analyzing the relationship of gamified elements to time-on- 
task (number of edits) [18] and comparing the impact of 
different course interventions to time-on-task (online inter- 
actions with peers and accessing course materials) [25]. Of 
note, both in earlier and more recent time-on-task studies, 
are the different measures or proxies used for time-on-task, 
and the methods used for identifying or approximating off- 
task activity and breaks. These are key factors that we 
explore in our comparison of coarse- and fine-grained time- 
on-task metrics. 


In research focused on time-on-task in learning to program, 
a conventional approach has been to log user interactions 
within integrated development environments (see e.g. [13, 
21, 22,24, 29]). For example, Jadud [12] used the BlueJ IDE 
to capture code “snapshots” (copies of source code) when- 
ever students compiled their programs, including compiler- 
reported errors and related metadata. Rodrigo et al. used 
BlueJ logs in combination with student surveys and observa- 
tions to explore relationships between novice programmers’ 
achievement, debugging, and syntax errors [26]. 


Submission data has been used to estimate students’ total 
elapsed time, the total time between a student’s first submis- 
sion and last submission. Edwards et al. noted that the dif- 
ference in total elapsed time between high- and low-scoring 
students is only small [6]. Similarly, the time between com- 
pilation events has been studied previously; Jadud observed 
that students are likely to recompile quickly after encounter- 
ing a syntax error, but spend more time working on code af- 
ter a successful compilation [11]. Definitions of work sessions 
also differ between studies. For example, Fenwick et al. [7] 
considered a “work session” terminated when no events were 
logged for 60 minutes. While the previous examples demon- 
strate the use of time from snapshots for estimating time- 
on-task, other studies in programming have explored using 
event counts (similar to other fields) for building predictive 
models of student achievement. For example, Ahadi et al. [1] 
used assignment-specific log data that included the number 
of “steps” that students took to solve each assignment for 
predicting course outcomes. 


The time-on-task metrics in these (and other studies, e.g. [20, 
28, 31]), however, suffer from similar problems of failing to 
capture the nuances around actual working time, as even 
the “work sessions” may fail to account for when and how 
students are working offline. 


2.2 Analyzing Time-on-Task Estimates 

Variations on time-on-task measures across studies and re- 
search instruments make it difficult to interpret and com- 
pare findings and bring into question whether or not the 


different metrics are indeed measuring or evaluating similar 
constructs. Some researchers have begun to explore this by 
looking at the different ways that researchers estimate time- 
on-task and analyzing how these estimation choices impact 
conclusions drawn from these measures. 


Kovanovié et al. [16], for example, looked at different time- 
on-task estimates from learning management system data 
and examined the impacts of these across courses from dif- 
ferent subject domains. Their findings suggest that strate- 
gies for time-on-task estimation can have significant effects 
on learning analytics models of student performance. Using 
data collected from an introductory programming course, 
Leinonen et al. [19] examined a family of time-on-task-related 
metrics such as self-reported study time, log-based time spent 
on assignments, and event counts correlated with each other 
as well as course exam outcomes. They noted that while 
similar metrics such as edit counts and event counts tended 
to have higher correlations, exam scores were not strongly 
correlated with any of the metrics, except for the number of 
completed assignments. 


While Leinonen et al. [19] did not analyze the impact of dif 
ferent break durations when estimating time spent on assign- 
ments, different break durations have been studied by both 
Kovanovié et al. [16] and Nguyen [23]. Kovanovié et al. and 
Nguyen both used time-on-task estimates based on times- 
tamp differences between two subsequent events in learning 
management systems and highlight the importance of a good 
time-on-task estimation strategy. 


Our work builds on this prior work by looking into data from 
an introductory programming course, where each keystroke 
associated with a course assignment was recorded and times- 
tamped. Using this fine-grained log data, we study the im- 
pact of different thresholds for measuring off-task behav- 
ior, contrasting the keystroke data with submission-based 
data more commonly used in studies focusing on academic 
achievement in learning programming. 


3. METHODOLOGY 
3.1 Context and Data 


The data for our study comes from a 7-week introductory 
programming course offered at a research first university in 
Europe. The workload of the course is 5 ECTS, which cor- 
responds to roughly 100 to 125 study hours. In the course, 
students learn the basics of procedural and object-oriented 
programming in Java. The course uses a many small assign- 
ments approach, where many of the course assignments are 
small, but combine to form larger programs. After working 
on small assignments, students are given larger assignments 
as well, where they practice the content and constructs that 
they have learned earlier. 


In total, the course had 147 programming assignments. The 
programming assignments are worked on in an integrated 
development environment (IDE), that logs keystroke data 
for plagiarism detection and research purposes. On each 
keystroke, the IDE collects the current timestamp and the 
modification to the source code of the assignment that the 
student is currently working on. Keystroke data is gathered 
only from course assignments. Additionally, information on 
when students submit their assignments is collected. 
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Students were informed about the data gathering on the 
course; our analyses included data from 137 students who 
consented to the use of their data for research purposes and 
who completed at least 10 assignments in the course. 


3.2 Research Questions and Metrics 
Our research questions are as follows: 


RQ1. How do fine- and coarse-grained time-on-task metrics 
differ in terms of measuring time-on-task? 


RQ2. Are there differences (a) between students and (b) be- 
tween assignments on how well coarse-grained time- 
on-task correlates with fine-grained time-on-task? 


In this study, we compare two different metrics for time- 
on-task that we call coarse-grained time-on-task and _fine- 
grained time-on-task. The metrics are calculated for each 
student for each exercise they attempted and submitted. 


Coarse-grained time-on-task is calculated as the difference 
between the timestamp of the first submission and the first 
keystroke event for that assignment. We used the first sub- 
mission instead of the last submission since some students re- 
submitted assignments that they had previously completed 
‘Just in case” right before the deadline. However, the choice 
of first versus the last submission does not affect the results 
considerably: in 95% of the cases, students only had a single 
submission for each assignment. 


Fine-grained time-on-task was calculated by computing the 
differences between keystroke timestamps in the data until 
the first submission of the assignment while ignoring any 
differences that were greater than a break threshold that is 
used to approximate off-task behavior or “outliers”. Differ- 
ent values for the break threshold are explored and reported. 


The key difference between the two metrics is that the fine- 
grained time-on-task takes into account the breaks that stu- 
dents take while working on assignments, whereas the coarse- 
grained time-on-task does not. If the break threshold is ar- 
bitrarily large, no breaks are removed when computing the 
fine-grained time-on-task, and the two metrics are identical. 


4. ANALYSES AND RESULTS 


4.1 Differences Between Time-on-Task Metrics 


To answer RQ1, we first analyzed different break thresh- 
old values to examine how different thresholds affect the 
number of distinct study sessions in the data. We define a 
distinct study session as any sequence of snapshots for an 
assignment between breaks in the data, where what is con- 
sidered as a break depends on the break threshold. We then 
examined how the choice of break threshold affects the cor- 
relation between the coarse- and fine-grained time-on-task 
metrics across the whole data set. The strength of the cor- 
relation between the metrics can signal whether the metrics 
are measuring the same phenomenon, i.e. time-on-task. 


Figure 1, in Appendix, shows how having a different break 
threshold for the fine-grained time-on-task affects the num- 
ber of distinct study sessions for thresholds between 30 sec- 
onds and 1200 seconds (i.e. 20 minutes). We see that having 
a very low threshold (e.g. anything under 100 seconds) re- 
sults in a very high number of study sessions compared to 


having a higher break threshold (e.g. anything over 600 sec- 
onds, i.e. 10 minutes). The figure only shows the number of 
sessions up to a break threshold of 20 minutes since at that 
point, the decrease in the number of sessions is very small. 
What this essentially illustrates is that if a student takes a 
short break of under 200 seconds or so, they are quite likely 
to return to the task, but if the break is longer (e.g. over 
10 minutes), they are not likely to return to the task soon. 
Based on this, in our data, a break threshold of around 600 
seconds would seem reasonable as at that point, the rate of 
decrease plateaus. 


Figure 2 (Appendix) shows the Pearson’s correlation coeffi- 
cient between coarse- and the fine-grained time-on-task met- 
rics for different break thresholds between 30 seconds and 
1200 seconds (20 mins.). We first note that for all the thresh- 
olds visualized in Figure 2, the correlation is weak since it 
varies between 0.33 and 0.37. The figure shows that the cor- 
relation increases slightly as the break threshold gets bigger, 
but similar to the number of study sessions, the rate of in- 
crease seems to plateau at around the 600 second (10 min.) 
mark. The correlation does continue increasing beyond what 
is visualized in the figure and eventually, at around 13 days, 
it reaches 1, where the fine- and coarse-grained time-on-task 
metrics are equal. This means that some students had a 
break of around 13 days within a single assignment. 


4.2 Student and Assignment-Specific Correla- 


tions Between Time-on-Task Metrics 

To answer RQ2a, we first calculated both time-on-task met- 
rics for each student for each assignment they submitted. We 
then calculated the correlation between the metrics for each 
student separately, which leaves us with a single correlation 
per student. We examine the distribution of these correla- 
tions to understand if there are differences between students 
on how much the fine- and coarse-grained time-on-task met- 
rics correlate. To answer RQ2b, we calculated the correla- 
tion between the coarse- and fine-grained metrics for each 
assignment separately, leaving us with a single correlation 
per assignment. Similar to RQ2a, we study the distribution 
of these correlations to see if there are assignment-specific 
differences in how well the two metrics correlate. 


For analyzing student and assignment-specific differences in 
how well the coarse- and fine-grained time-on-task metrics 
correlate, we used a break threshold of 600 seconds (i.e. 10 
minutes) for the fine-grained time-on-task metric. We chose 
600 seconds as the results for RQ1 showed that in our data, 
600 seconds seems like a reasonable value to consider a stu- 
dent being on a break (Section 4.1). 


Figure 3, in Appendix, shows the distribution of the cor- 
relations between the coarse- and fine-grained time-on-task 
metrics for individual students. The mean correlation is 0.47 
with a standard deviation of 0.24 and the 95% confidence in- 
terval is 0.43 to 0.51. We notice from the figure that there 
are differences between students in how well the coarse- and 
fine-grained time-on-task metrics match each other. On av- 
erage, the correlation seems moderate, with most students 
having a correlation between 0.2 and 0.6. 


Figure 4, in Appendix, shows the distribution of the correla- 
tions between the coarse- and fine-grained time-on-task met- 
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rics for individual assignments. The mean correlation is 0.33 
with a standard deviation of 0.21 and the 95% confidence in- 
terval is 0.30 to 0.36. We notice from the figure that similar 
to students, there are also differences between assignments. 
Compared to the between-students analysis (RQ2a), the as- 
signment distribution is slightly more centered around the 
mean. Similar to the between-student analysis, the correla- 
tions for the assignments are also, on average, moderate. 


5. DISCUSSION 


5.1 Coarse- vs Fine-Grained Time-on-Task 
We observed that our coarse-grained time-on-task metric 
poorly approximated our fine-grained time-on-task metric. 
The coarse-grained metric imitates metrics from earlier work 
where time-on-task has been calculated based on, for exam- 
ple, students’ first and last submissions for an assignment [6], 
while the fine-grained time-on-task metric is somewhat sim- 
ilar to earlier works that utilized LMS trace data [16], al- 
though considerably more fine-grained. 


We propose that the fine-grained metric explored in this 
work is a better metric for measuring time-on-task than a 
metric that relies on coarse-grained data, but removes out- 
liers to keep time-on-task values meaningful. Prior work 
has suggested, for example, that large values are just ig- 
nored [16]. However, if we rely on removing outliers, we are 
bound to include data that is not accurate that was simply 
not caught by the outlier detection. For example, if two stu- 
dents both have a time-on-task estimate of two hours with 
a coarse-grained time-on-task metric, it is possible that one 
of them worked for ten minutes, while the other worked for 
a full 120 minutes. In this case, the actual time-on-task 
is drastically different, but the coarse-grained time-on-task 
estimate would be the same for both. 


One downside of the fine-grained time-on-task metric is that 
it requires a break threshold to calculate time-on-task. De- 
ciding on a good break threshold is not straightforward, and 
is most likely context-dependent. This work is not the first 
to note this issue: for example, both Nguyen [23] and Ko- 
vanovié et al. [16] examined different cut-offs for outlier de- 
tection, which is similar to our work in examining different 
break thresholds. 


5.2 Student- and Assignment-Specific Corre- 


lations 

We identified student- and assignment-specific differences in 
how well coarse- and fine-grained time-on-task metrics cor- 
relate. This makes sense since the main difference between 
the metrics is that the fine-grained metric takes the breaks 
students take into account; thus, if a student does not take 
many breaks while working on assignments, the difference 
between the two time-on-task metrics will not be significant 
compared to a student who takes long breaks within single 
assignments. Here, factors such as possible previous pro- 
gramming experience and study fatigue may come into play 
and should be analyzed in future work. 


Similarly, we found that there are differences between as- 
signments in how much the two metrics correlate. Since the 
course has many small assignments, but also some bigger, 
more complex assignments, it makes sense that, for example, 


students might take more breaks during the bigger assign- 
ments compared to the smaller ones, which would have an 
effect on the correlation between the two metrics. 


5.3. Conclusion and Future Work 

In this work, we studied how two different time-on-task met- 
rics built from programming log data correlate with each 
other. One of the metrics utilizes fine-grained keystroke data 
and takes the breaks students take during assignments into 
account by not including the breaks in its time-on-task esti- 
mate. The other time-on-task metric is more coarse-grained 
and includes any breaks students take during assignments 
in its time-on-task estimate. 


Our results show that the correlation between the two met- 
rics is at best moderate, which suggests that the choice of 
time-on-task metric can significantly impact the results of 
studies based on time-on-task analysis. This brings into 
question whether previous results that have used different 
metrics for measuring time-on-task are comparable with one 
another. Additionally, our results show that, at least in our 
context, there are also student- and assignment-specific dif- 
ferences in how much the two metrics correlate. 


We acknowledge that we do not have a ground truth for time 
on task, i.e., both our metrics are only proxies. As part of 
our future work, we are looking into augmenting keystroke 
data from the programming environment with log data from 
other learning environments and self-reported time-on-task 
estimates. Similarly, in this work, we examined different 
break thresholds over all the data when identifying a break 
threshold; in future work, we will be looking at to what 
extent optimal break thresholds vary between students. We 
also acknowledge that we did not analyze how time-on-task 
relates to course outcomes, which has often been included in 
time-on-task studies (e.g. [16, 19, 23]). In the future, we will 
also be looking into how the studied metrics and different 
break thresholds relate to course performance. 
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Figure 1: Number of distinct study sessions with differ- 
ent thresholds for breaks for the fine-grained time-on- 
task metric. The x-axis is the threshold for considering 
the student to be on a break in seconds. The y-axis is 
the number of study sessions in the data. Data is shown 
for thresholds between 30 seconds and 1200 seconds (20 
minutes). 
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Figure 2: The correlation between the coarse and the 
fine-grained time-on-task metric with different thresholds 
for breaks for the fine-grained time-on-task metric. The 
x-axis is the threshold for considering the student to be 
on a break in seconds. The y-axis is the Pearson corre- 
lation coefficient between the coarse and the fine-grained 
time-on-task metric. Data is shown for thresholds be- 
tween 30 seconds and 1200 seconds (20 minutes). 
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Figure 3: The distribution of student-specific correlations 
between the coarse- and fine-grained time-on-task met- 
rics, where fine-grained time-on-task was calculated with 
a 600 second break threshold. The x-axis is the Pearson 
correlation coefficient and the y-axis is the density. 
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Figure 4: The distribution of assignment-specific corre- 
lations between the coarse and the fine-grained time-on- 
task metrics, where fine-grained time-on-task was calcu- 
lated with a 600 second break threshold. The x-axis is 
the Pearson correlation coefficient and the y-axis is the 
density. 
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ABSTRACT 


Learning analytics (LA) is collecting, processing, and 
visualization of big data to optimize learning. This article aims to 
interpret the impact of analyzing learning data for tertiary 
education. The article describes a semester-long mixed methods 
study for 63 students enrolled in a Greek technical university 
laboratory, retrieving data from the learning management system 
(LMS). We applied minimal LA guidance in the experimental 
group and no LA guidance in the control group. The research 
questions are as follows: Can a student-facing learning analytics 
approach at minimal level guidance improve students' LMS access 
and learning performance levels? Are the students' LMS access, 
discussion forums, and submitted assignments, critical predictors 
for students' course grades? What are students' opinions about 
learning analytics as a tool for data-driven decision-making 
strategy? The study followed the do-analyze-change-reflect LA 
model. The data collected included students’ time spent on LMS, 
exercises, and discussion posts, while the dependent variable was 
the course grade. Results indicate that it increased the students' 
LMS access and satisfaction when we applied LA but not their 
final grade. Future research could apply higher effort 
interventions and stronger teacher guidance to provide insights 
into student performance, engagement, and satisfaction. 


Keywords 


Student-facing learning analytics, Performance, Satisfaction, 
Teacher guidance, Post-secondary education. 


1. INTRODUCTION 


Learning analytics is a multidisciplinary field between computer 
science and education that fosters the learning process based on 
big data monitoring [10]. In [29], the authors defined LA as the 
measurement, analysis and reporting of data about learners and 
their contexts, for purposes of optimizing learning and the 
environments in which it occurs. Furthermore, the LA tasks are a 
set of handy tools to collect and analyze the data accumulated in a 
smart classroom for data-based decision-making _ [1]. 
Consequently, without analytics, instructors cannot provide 
guidance at appropriate times when students encounter difficulties 
[11]. In parallel, institutions have embedded LA techniques to 
enhance retention rates, use resources effectively, and increase 
students' engagement, satisfaction, and motivation [26]. 


The authors conducted this mixed-methods study with the 
research objective of mapping student-facing learning analytics 
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(LA) in real tertiary educational settings. The article is organized 
as follows: (1) we conduct a short literature review, (2) we explain 
how the research questions were formulated, (3) we illustrate the 
design and results of the experiment, and (4) we present the 
discussion and conclusions reached. 


2. RELATED WORK 

Student-facing LA is a subfield of LA and focuses on the 
reporting phase, such as LA _ dashboards, educational 
recommender and feedback systems [5, 6]. It is challenging to 
show students a dashboard or automated emailing systems and 
conduct surveys to extract usage insights [20]. According to [20], 
a well-established student-facing LA system consists of four 
learning design phases (do-analyze-change-reflect). To provide a 
theoretical framework and extract the research questions, we 
conducted a short literature review about student-facing LA. The 
studies can be classified in terms of (1) improvement of 
performance, (2) prediction of student course grade, (3) 
improvement of LMS access, and (4) student opinions and 
satisfaction of LA. 


A series of studies [23, 30] have explored the idea of student- 
facing LA improving levels of performance. Students' final marks 
could determine the assessment of their academic achievement 
[19, 33]. In contrast, academic performance and attainment are not 
related to student access behavior perforce [17]. Nevertheless, we 
argue that only a few studies examine under LA interventions the 
correlations between LMS use, the number of submitted 
assignments, and forum posts as metrics for performance. 
Furthermore, we need more research to examine if the low effort 
LA interventions could positively affect students' performance. 
After all, explaining the students' learning performance is a 
continual research question. 


LA predictive modeling is a core practice of scholars focusing on 
student success [22]. In [18], a data mining process constructs 
variables that reflect the theoretical evidence and measure a 
prediction model's accuracy. In addition, [31] presented a 
prediction model for failure-prone students using neural networks 
techniques. These studies emphasize that student-performance 
prediction is a dominant research domain. Despite the above 
studies, we argue that building a predictive model for students’ 
performance based on critical predictors such as LMS 
participation and submitted assignments is an interesting research 
question. 


Engagement can substantially impact students' performance [4, 
14]. In [6], the authors have explored the idea of student-facing 
LA, improving levels of engagement. They have indicated that 
academic engagement is a multi-dimensional construct and refers 
to students' level of involvement [8, 15]. However, we argue that 
not many studies examine the effect of student-facing LA 
interventions on students’ level of engagement. 
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Targeted studies exist on students' opinions of LA [20, 28]. The 
[32] study empirically explored the effects of a mobile LA tool in 
student satisfaction. Nevertheless, we consider that students' 
opinions before and after LA interventions need further research 
and could extract valuable insights concerning — students' 
satisfaction and expectations. The surveys' results will confirm or 
not the existing ones. 


Drawing upon the findings of the above studies, our purpose is to 
investigate the open issues. It would be meaningful to know 
whether subjects in the feedback conditions gain learning benefits 
such as performance and satisfaction. In parallel, integrating the 
LA concepts into tertiary classroom practice has been slow [12]. 
This article replicates similar research and aims to interpret 
analyzing learning data for higher education institutions (HEI). 


2.1 Research Questions 
Within this context, the current experimentation study poses the 
following research questions: 


1. Can a student-facing learning analytics approach at minimal 
level guidance improve students' LMS access and learning 
performance levels? 


2. Are the students' LMS access, discussion forums, and submitted 
assignments, critical predictors for students' course grades? 


3. What are students' opinions about learning analytics as a tool 
for data-driven decision-making strategy? 


3. METHOD 


3.1 Participants and Context 

This study took place in the authentic context of a sixth-semester 
13-week undergraduate department laboratory course, "digital 
signal processing" (DSP), at a Greek HEI computer science 
department between February 2018 and June 2018. The reason for 
selecting this particular blended course was the high dropout and 
failure rate in the past exams. This study focused on 31 students 
as an experimental group receiving the LA _ intervention 
("treatment") with minimal teacher guidance tested for 
comparison purposes. Participants were 26% female. The control 
group had 32 students who received no particular LA intervention. 
The instructor had two-hour lectures and _ face-to-face 
meetings/office hours on Mondays every week with the students. 


An overview of the LA tool that students used follows: The Open 
eClass platform is an open-source LMS and is developed by the 
non-profit civil company called "Greek Universities Network" 
(GUNET)  (https://www.gunet.gr/en/). The platform's main 
features follow: Management of electronic courses and 
educational content; Student management; Information, 
communication, collaboration, evaluation and feedback tools. The 
structure of the course was as follows: 


Week 1. Module 1: The nature of DSP was explored. To ensure 
transparency and institution-wide adoption [34], we informed the 
department principal in detail about the experiment, after which 
she enthusiastically gave her consent. It was then defined what 
types of data should be tracked and that the feedback (dashboard 
and messages) would be intended for students. 


Week 2. Module 2: Fundamental signals. The first coding exercise 
was performed in addition to weekly discussion threads and office 
hours. We gave a detailed description of which student-facing LA 
will be used and how students will utilize them. 


Week 3. Module 3: Digital signal sampling. For usability testing, 
the students described their initial experience of using LA. The 
students were surprised, as many claimed that it was impossible to 
support concepts such as monitoring, analyzing, and feedback. 


Week 4. The first quiz assignment and second coding exercise 
took place. The instructor contributed to the discussion forum to 
give a sense of learning community. We provided verbal 
encouragement for students to access their statistics and figures 
via the LA tool to reflect and meditate. 


Week 5. The second quiz assignment and third coding exercise 
took place—module 4: Fourier transformation principles. We 
discussed the self-reflection and meditation process. 


Week 6. Active intervention and feedback with personalized 
messages containing the grades of the students’ assignments, 
recommendations, and comparisons of their performance with 
aggregated data (e.g., participation in discussions and submission 
of assignments). The encouraging wording of the messages was 
designed to benefit pedagogically and not harm the student. For 
instance, "do you need some support?" or "you could participate 
more in the discussion forum." We provided personalized 
feedback with visualizations for tracking students' learning 


progress. 


Week 8 and 9. Module 5: Digital filters. Provide in-class feedback 
(figure 1), recommendations, and scheduling for personalized 
scaffolding. Verbal suggestions informed students about what to 
do based on analytics. 


Week 10 and 11. The third quiz and an exercise took place. We 
provided in-class information about absences, participation, and 
homework. Students received personalized messages with 
visualizations of their learning progress for mirroring, self- 
reflection, and motivation. 


Week 12 and 13. A revision session and a collaborative quiz were 
conducted in addition to weekly monitoring and analysis. We used 
a think-aloud protocol to understand how students reclaimed 
feedback. A final questionnaire took place—Week 14. The final 
examination was conducted. 


# visits and duration of LMS use per week 
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Figure 1. Personalized feedback with visualizations for 
mirroring. 
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3.2 Research Design 

A mixed quantitative and qualitative method case study was 
utilized to provide an instantiation of the LA framework with a 
description of the methodology for others to use a similar process. 
To answer the first and second research questions, we conducted a 
causal research design. 


3.3. Measures and Instruments 
The target of our intervention is students’ LMS access, 
performance, and learning satisfaction. The analysis object is 
discussion forum use, the number of exercises submissions, and 
LMS access, while all variables are numeric. 


Performance: The performance is measured by a _ simple 
dependent variable, the final course grade, that has convenient 
properties for causal and statistical analysis. The grading system 
of the final exam is as follows: Scale: 0.00 — 10.00 / Pass: 5:00 
(excellent: 8.50 — 10.00 / very good: 6.5 — 8.49 / good: 5.00 — 
6.49). The independent scale variables and their definitions that 
we have considered were: "discussions" that counts the number of 
posts per student, thus the LMS discussion forum's involvement; 
"exercises" that counts the submitted assignments accumulated. 
Each weekly assignment asked true/false, multiple-choice, open- 
ended questions, and coding exercises; "hoursonlms" that counts, 
in hours, students' LMS access. 


LMS access: Student engagement is a complicated measured 
construct but vital for students’ success that encloses more than 
participation, motivation, and self-regulation [21]. Therefore, 
student LMS access in time is an indicator of student engagement. 


Satisfaction: The instruments that we used to collect student 
opinion data are two student opinion questionnaires. An 
individual questionnaire was administered at the beginning of the 
course and another at the end. The questionnaire data incorporated 
participants' reflections on the activity and helped us to collect 
qualitative data about their opinions as an evaluation. 


3.4 Data Collection and Analysis 

The level of significance was set at p = 0.05. Graphically, we 
examined the same assumptions and checked for no outliers. 
Some visualizations (i.c., dot plot, histogram, a boxplot for 
density, skewness, and variability) were produced. Finally, for 
data processing and analysis, the SPSS 25.0 statistical application 
processed the data. 


4. RESULTS 

We first applied normality (Shapiro-Wilks) and variance (Levene) 
controls on available data. The results (p > 0.05) indicated 
statistical non-significance suggesting that sample data come from 
normal distributions and populations with the same variance, 
therefore appropriate for parametric test analysis. 


4.1 Research Question 1 


Hypothesis 1: The performance, as measured by the course grade 
of the experimental group, is not statistically significantly 
different from that of the control group. 


The mean score of the experimental group (M = 6.08, SD = 2.62) 
was slightly higher than that of the control group (M = 5.49, SD = 
1.60). However, the independent samples t-test comparing course 
grades between the groups revealed no statistically significant 


differences (t = 1.077, p = 0.287) (Table 2). Overall, the null 
hypothesis failed to be rejected. 


Table 2. The t-test results of the experimental and control 
groups for performance 


Group N Mean SD t p 
Experimental 31 6.08 2.62 1.077 | 0.287 
Control 32 5.49 1.60 


Hypothesis 2: The experimental group's LMS access (in hours) is 
not statistically significantly different from that of the control 


group. 


Table 3 shows that the mean of the overall LMS access for the 
experimental group was higher than that of the control group. The 
independent samples t-test comparing LMS use between the 
control and experimental groups revealed statistically significant 
differences (t = 4.610, p = 0.000). Overall, the null hypothesis is 
rejected. 


Table 3: The t-test results for LMS access 


Group N Mean SD t p 
Experimental 31 10.03 7.79 4.610 | 0.000 
Control 32 3.41 1.84 


4.2 Research Question 2 


Focusing on the experimental group, we examined the Pearson 
correlations (Table 4), extracting that the submitted exercises 
("exercises") are highly positively correlated with the final course 
grade ("finalgrade"). Also, time spent on LMS ("hoursonlms") is 
weakly positively correlated with the final course grade. However, 
there is a tendency but no statistically significant correlation 
between forum posts ("discussions") and the final course grade. 


Afterward, a simple regression analysis was conducted for 
"exercises" to estimate the final grade. The check (ANOVA) of 
the hypothesis that no regression showed that this hypothesis is 
rejected (F = 18.156, p = 0.000). To evaluate this regression 
model, the Pearson correlation coefficient (Table 4) (R = 0.620, p 
= 0.000) reflects the predictor importance; thus, we extracted a 
good predictor. Then, the model accuracy (quality) is 61.1% and 
the determination factor (R-squared = 0.385 < 0.5) is considered a 
low effect size. Finally, the model's equation is y=1.002*x+3.954 
(y: final grade, x: exercises). 


Then, a simple regression analysis was conducted for 
"hoursonlms" to estimate the final grade. The Pearson correlation 
coefficient (Table 4) (R = 0.392, p = 0.015) reflects the predictor 
importance. Thus, we extracted a weak predictor with a 
determination factor (R-squared = 0.154 < 0.3) to be considered a 
weak effect size. Furthermore, we observe a high correlation (R = 
0.749, p = 0.000) between the above two predictors. As a result, 
we decided that there is no need to conduct a multiple regression 
analysis. 


Table 4. Pearson correlations, in parentheses Sig. (two-tailed)* 


finalgrade | hoursonlms exercises discussions 


finalgrade 1.000 | 0.392 (0.015) | 0.620 (0.000) 0.284 (0.061) 


hoursonlms 1.000 0.749 (0.000) 0.525 (0.001) 


exercises 1.000 0.314 (0.043) 


*Significant difference at the 0.05 level 
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4.3 Research Question 3 

We applied two questionnaires to address the third research 
question (What are students’ opinions about learning analytics as a 
tool for data-driven decision-making strategy?). Twenty-three 
students submitted the first questionnaire (appendix), and the 
purpose was to determine as a baseline their prior knowledge of 
the LA field. Fifteen students submitted the final questionnaire. 
The purpose was to determine their thoughts about the LA 
experience and their overall satisfaction and acceptance. Students 
gave responses in the free comments field. Based on the mining of 
students' opinions and perceptions, the LA experience increased 
students' learning satisfaction. To summarize, students argue that: 
LA feedback is helpful for their learning progress; they expect 
that LA is applied in most courses; student-facing LA tools via 
smartphone would have an added-value impact; peer-comparison 
progress dashboards increase their engagement. 


5. DISCUSSION AND CONCLUSIONS 


5.1 Research Question 1 


Table 3 shows that LMS access was significantly higher for the 
treatment group who used LA. This result is consistent with that 
mentioned by [3] and [25]. Students were triggered using LA to 
submit more quizzes and exercises (cognitive activities). In 
addition, students increased their sense of belonging to an online 
community. However, the findings suggest that high LMS access 
does not necessarily affect performance [36]. 


The experimental group had slightly higher scores than the control 
group (Table 2). This result is consistent with the one mentioned 
by [27], who stated that "students valued the information, but, 
despite high engagement with the information, students' study 
behavior and learning outcome remained rather unaffected." In 
contrast, many studies [3, 13, 16, 24, 25] have stated that students 
tend to perform better when the students accept LA interventions. 
An explanation is that our LA approach resulted in delayed and 
low effort interventions, which affected the students’ overall 
performance. The standard deviation (SD) values in Table 2 show 
high diaspora, especially in the experimental group, so we argue 
that the LA impact affected the students in an outspread way. We 
conclude that there is no performance improvement without the 
instructor's strong guidance and targeted interventions. 


5.2 Research Question 2 

Some of our findings are consistent with the results of other 
related studies. Based on Table 4, we observe moderate statistical 
correlations between time spent on LMS and the final grade and 
between the number of assignments’ submissions and the final 
course grade. This result is aligned with that mentioned by [2] and 
[15]. Our prediction model for academic performance confirms 
the results of related studies; thus, we need models with higher 
accuracy and effect size [7, 9]. 


5.3 Research Question 3 


We conclude that the students' satisfaction was high, in agreement 
with findings in [25]. Students' positive response to the usefulness 
of student-facing LA is in agreement with the literature [6, 30]. In 
addition, the students' responses in the reflection phase confirm 
the discussion of [20] that students should analyze their behavior 
using their self-regulated methods. In accordance with [35], the 
above findings strengthen understanding students' opinions of LA 
qualitatively rather than as technical methods. Furthermore, 
interpreting students' comments, we argue that students liked this 


new learning approach following personalized reports. Students 
would like LA personalized interventions with a smartphone 
application and comparisons of their learning progress with their 
classmates. In conclusion, students' sample quotes extract 
emerging themes: awareness of others in the class, motivation, 
increased satisfaction and self-regulation, and technical proposals. 


5.4 Limitations 

We acknowledge that there are certain limitations to this small- 
scale study that prevent its findings from being generalized. First, 
the small sample size and the context of the dataset limit the 
findings. The data covers one semester on a very domain-specific 
course at one Greek university, and institutional factors influence 
the results. Furthermore, the LMS captures a subset of all the 
events in a learning experience, while other student characteristics 
may influence student outcomes. It would be useful to search for 
other factors or latent variables that might differ between the two 
groups in order to improve the results. Second, engagement was 
measured in terms of quantity rather than quality. More factors 
that influence student engagement quality should be studied, such 
as teacher participation and student effort. Third, the 
questionnaire's answers indicate that students in the experimental 
group are satisfied with the LA tool; however, we do not know 
how LA impacts students’ decision-making strategy. 


5.5 Future work 

It is our intention to replicate the study with another treatment 
group applying a strong (high effort) teacher guidance to see the 
impact in relation to the minimal (low effort) group. We will 
evaluate the impact of three levels of LA interventions: mirroring, 
metacognitive activities, and explicit guidance. Furthermore, we 
intend to focus on replicating the experiment in other course 
settings with larger populations, different profiles, and the use of a 
mobile-based user-centered LA application. It would be 
constructive to build and test a predictive model with higher 
accuracy and stronger effect size applying sophisticated machine 
learning or deep learning algorithms. 
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Appendix 


First questionnaire 


Question Answer 
Yes No No answer 
Do you know what analytics is? 10 13 - 
Do you know what learning analytics is? 6 17 - 
Do you believe that the collection and processing of your learning data and behavior 22 1 - 


will be helpful to your learning experience? 


Would you be interested in being informed about your learning progress concerning 13 6 4 
your classmates? 


Would it be helpful to have feedback (e.g., personalized monitoring and 22 1 
individualized learning material) on your learning progress? 


Final questionnaire 


Question Answer 
Yes No No answer 
Were the personalized notifications about your engagement, absences, and performance useful for the 14 1 - 
course? 
Would you prefer more detailed information? 4 8 3 
Would you prefer to receive notifications and messages through a smartphone? 10 2 3 
Is the comparison of your learning progress with that of your classmates useful? 11 2 2 


Please provide free comments: 
"It is the first time for me that a teacher has sent personalized messages to all students about their learning progress. I have nothing more to suggest. It 
would be great to convince the other teachers to do the same". 


"The whole procedure with the exercises, the open discussions, and generally the lecturers' teaching methods helped me very much to self-regulate. I 
enjoyed both class time and homework". 

"It was the first time that we had received such refined, analytical, and informed monitoring about our progress and performance on a course." 

"I would like access to an LA android-based application." 

"I liked the quizzes the most, and I would prefer to be informed via a smartphone app." 

"There was sufficient and motivational guidance from the instructor about online exercises." 

"Instructor's comments about the exercises on LMS were constructive, analytical, and motivational." 

"I would like more teacher guidance about the exercises, the learning material, and the overall learning procedure." 
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ABSTRACT 


With the goal of making vast collections of open educational 
resources (YouTube, Khan Academy, etc.) more useful to 
learners, we explored how automatically extractable text 
representations of math tutorial videos can help to cate- 
gorize the videos, search through them for specific content, 
and predict the individual learning gains of students who 
watch them. In particular, (1) we devised novel text rep- 
resentations, based on the output of an automatic speech 
recognition system, that consider the frequency of different 
tokens (symbols, equations, etc.) as well as their proximity 
from each other in the transcript. Unsupervised learning 
experiments, conducted on 208 videos that explain 18 math 
problems about logarithms show that the clustering accu- 
racy of our proposed methods reaches 85%, surpassing that 
of standard TF-IDF features (78% using log normalization). 
(2) In a video search setting, the proposed text features can 
significantly reduce the number of videos (up to 88% reduc- 
tion on our dataset) and amount of video time (up to 82%) 
that users need to spend looking for desired content in large 
video collections. Finally, (3) in an experiment on Mechani- 
cal Turk with n = 541 participants who watched a randomly 
assigned tutorial video between a pretest & posttest, the text 
features and their multiplicative interactions with students’ 
prior knowledge provide a statistically significant benefit to 
predicting individual learning gains. 


Keywords: Open educational resources (OER), Crowdsourc- 
ing, Information Retrieval 


1. INTRODUCTION 


Consider a large repository (Khan Academy, edX, etc.) of 
open educational resources (OERs) such as tutorial videos, 
and a scenario in which the ultimate goal is to help learners 
to learn by recommending relevant and high-quality con- 
tent that matches the students’ needs. Knowing what the 
learner needs and providing the right content that suits them 
is crucial. We could estimate automatically the most bene- 
ficial content by analyzing their performance on prior exam- 
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First step let we find the Remember you that.... 


value for denominator Log,a=1 
marked in red... == 


Figure 1: Example videos in our study. Right: Google’s 
Speech-to-Text extracts the text “solve for x ok our problem 
is log base 3 of x minus 1 equals 4...”. 


inations. However, a current challenge with contemporary 
OER repositories is that the content within each resource 
is typically poorly annotated, with tags that are too gen- 
eral, e.g., “algebra” or “linear equations” rather than “Sim- 
plify log,, 1000.”. Given the high labor and time involved 
in manual annotation, it is desirable to devise methods of 
automatically analyzing OER content and devising represen- 
tations that can facilitate efficient search and categorization. 


While optimal character recognition and handwriting recog- 
nition are both mature fields, they are typically evaluated 
in much more constrained settings than math tutorials, in 
which math is mixed with natural language, and extrane- 
ous lines and other graphics can exist (see Figure 1). In 
full-fledged tutorial videos, this segmentation can be very 
challenging. Our research focuses instead on analyzing the 
speech transcript of the video (while ignoring other potential 
audio characteristics such as background noise, pitch, etc.). 
When a particular expression or equation is presented in a 
video, there is a high chance that the speaker will also say 
that expression/equation out-loud to the learners (Figure 1). 
Rather than manually transcribing the text from the video, 
we consider only fully automatic approaches based on auto- 
matic speech recognition (ASR; we used the Google Speech- 
to-Text API in our work, more detailed about the pilot test 
in Appendix B). Hence, the text representations we explore 
must contend with imperfect transcripts. We then assess 
the utility of the proposed representations for three tasks: 
(1) cluster the videos automatically into the specific math 
problems that they explain; (2) search through a library 
of videos for one that explains a particular math problem; 
and (3) predict the individual learning gains of students who 
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watch the videos in a pretest /treatment/posttest paradigm. 
In these ways, we hope to make available to students the 
right content that is already available, but not easily find- 
able, among large-scale OER repositories. 


We conduct our investigation on a collection [14] of math 
tutorial videos about logarithms, and another dataset from 
YouTube on basic algebra. Our goal is not just to make 
coarse distinctions between videos about “algebra” versus 
“geometry”, but rather fine-grained distinctions about spe- 
cific math problems. Mirroring our goals from the previous 
paragraph, our research questions are the following: RQ1: 
How accurately can the devised text features cluster videos 
into fine-grained categories about the specific problem they 
are solving, and which aspects of these representations are 
most important? RQ2: By how much can we reduce the 
search time to find a relevant video? RQ3: Are the text fea- 
tures predictive of the individual learning gains of students 
who watch these videos in a pretest /posttest setting? 


2. RELATED WORK 


Text Representations: There are several prominent text rep- 
resentations used for language modeling: (1) Term frequency 
and Inverse document frequency (TF-IDF) [12]: TF-IDF 
features typically do not require training and are thus suit- 
able for unsupervised settings. (2) Word embedding mod- 
els [8, 7] based on neural networks trained using supervised 
learning. (3) Sentence-level models such as BERT [8] that 
capture higher semantics compared to word embeddings. 


Video Categorization & Clustering: For categorizing video 
content automatically, much of the prior work has focused on 
other fields than math tutorial such as films, sport videos [2], 
[4], [11]. Most prior methods on video categorization focus 
on visual aspects such as frame transitions, object detection 
and segmentation. Some as them use the audio (e.g. [2]) 
such as the audio frequency and amplitude statistics. We 
are unaware of any previous research that clustered video 
content at the low-level tags of individual math problems. 


Video Retrieval in OERs: There has been increasing in- 
terest in the task of video retrieval of OERs. Many works 
in have pursued combined feature representations with both 
textual and visual information [13, 15, 3]. Hiirst [3] found 
that the lecture slides are more useful than the corrected 
transcriptions. In our work, while we focus solely on text 
representations, the features we devise could be easily com- 
bined with visual features. 


Estimating the Effectiveness of OERs: For the task of es- 
timating the effectiveness (e.g., associated learning gains) 
of viewing tutorial videos, researchers have pursued various 
approaches, including estimating their effectiveness through 
correlated measures such as engagement while watching the 
video [10, 6, 1]. For estimating the effectiveness of OERs 
in general, one can also use a combined experimental and 
reinforcement learning-based approach such as bandit algo- 
rithms [9]. While Rafferty et al. [9] suggested the potential 
use of context (for example, features of the OERs as well 
as of the students’ prior knowledge) for predicting learning 
gains, they did not actually pursue that approach. 


3. TEXT REPRESENTATIONS 


In this paper we explore unsupervised representations of the 
transcripts of math tutorial videos. When designing the rep- 
resentations, we considered the following characteristics: (1) 
Similar content should involve similar tokens. A math video 
whose transcript consists of just “two plus three”, for exam- 
ple, is unlikely to be similar to a video whose transcript is 
“four times x”. (2) The most important tokens tend to re- 
cur within a video transcript. Conversely, tokens that are 
uttered only once are often less important or even be tran- 
scription errors. (3) The relative order of nearby tokens is 
important for deciphering the math content. For example, 
“four over two” and “two over four” are different fractions, 
but the difference is reflected only in the relative order of 
tokens, not in their frequencies. For characteristics (1) and 
(2) above, we created several variations of “1D” text repre- 
sentations that capture which tokens occur more frequently 
in each video. With the additional characteristic (3), we 
also explored “2D” text representations that can capture the 
relative order within a fixed radius from token i w.r.t. token 
j for each (i,j) pair. We note that extracting the precise 
mathematical expression from the transcript is inherently 
ambiguous. For example, the two distinct expressions 2*+? 
and 2” + 2 would likely both be spoken as “two to the x 
plus two”. Fortunately, our objective is not to capture the 
math content perfectly, but to capture enough of it to en- 
able effective clustering, search, and prediction of learning 
gains. Below we describe different kinds of unsupervised 
text representations that vary in terms of token type, order 
dependency, and summarization method. 


3.1 Token Types 
3.1.1 Individual Token 


As our simplest representation, we call each word (sepa- 
rated by space) a token, and then we count the number of 
math-related tokens, defined as: (1) numbers (digit-only), 
(2) operations (e.g. +,—, x), or (3) variables (an alphabet). 
For the operations, we map synonyms to the same token, 
e.g., ‘plus’ to ‘+’, ‘to the [power]’ to ‘~’. Additionally, we 
add the words corresponding to each digit 0 to 9 (i.e. ‘zero’, 
.., nine’) as math-related tokens. For variables, we used 
a restricted alphabet consisting of {b,c,n,m,w, x,y, z} (we 
omitted ’a’ since it is also a common English word). 


3.1.2. Expression Token 

To infer which math problem in video, it might be useful to 
extract the entire expression.For example, “2 plus 3” could be 
considered as one token “2+3” not ‘2’, ‘+’, and ‘3’. Specif- 
ically: (1) We mark all tokens in the transcript as either 
math-related or non-math-related. Tokens that are labeled 
as math-related are literals (LIT) and operators (OP) such 
as plus (+), / , etc. (2) For each contiguous sequence 
of math-related tokens, we read the tokens one-by-one and 
concatenate them into one expression according to the rule: 
starting with LIT followed by OP, LIT, ... (alternately). 


3.2. Token Count Vector 


Given the sequence of tokens in each video, we then com- 
pute either a 1D vector or 2D matrix of frequency statistics 
(which are finally summarized as described in Section 3.3). 
In the subsections below we let 7 be the set of all tokens 
that appear in any of the videos. 


662 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


1D (No Order Dependencies): The count vector of each 
video contains |7| components, each of which records the 
frequency of token occurs in video. 


2D (First-Order Dependencies): With the goal of encoding 
the relative order of tokens, we computed a 2D matrix M, 
of size |T| x |T|, such that Mj; is the number of times that 
token 7 appears before token 7 in the transcript. In this 
approach, we introduced a “radius” parameter k to limit the 
distance of token pairs (i,j) that need to be considered. For 
example, if k = 4, all token pairs (7, 7) such that the distance 
between 7 and 7 is < 4 will be counted, otherwise, ignored. 


3.3. Token Summarization Methods 

Given the token count vector computed in Section 3.2, we 
then summarize each count x using a summarization func- 
tion f. We considered the following functions: (1) Raw Fre- 
quencies: We let f(x) = x. (2) Binarized Frequencies: Bina- 
rizing the counts x might be less susceptible to noise; hence, 
we tried setting: f(z) = 1lifa > land f(x) =0ifa =0. (3) 
Weighted Frequencies: It might be beneficial to weight down 
tokens which appears once because it might be noise from 
the extraction process; important tokens should be men- 
tioned multiple times in video; token found only once (we 
call it t=1) are either insignificant or incorrectly extracted. 
Instead of removing t=1, we introduced the parameter r to 
downweight t=1. In this case, instead of having the raw 
frequencies, We fixed the weight of ty; (appear more than 
once) as 1; however, we downweight t=i by r. We thus let 
f(x) =1/r if =1, f(x) = 0ifa =0, and f(x) =1lifae >1. 
Note that r = 1 is equivalent to Binarized Frequencies. 


4. DATASET 


We applied the text representations to two sets of tutorial 
videos: (1) Logarithms and (2) Algebra, see Appendix A. 


5. CLUSTERING 


Given the different feature types, we test whether they serve 
as an effective basis for clustering the videos. In this section, 
as ground-truth cluster labels, we took the math problem 
(there were K = 18 unique problems in total) that each 
video explained as its label. Note, however, that we could 
also cluster the videos by the category of problems that they 
explain (see Section 4); we do so in Section 7. 


Methods: For each of the different text representations, we 
applied K-means clustering to group the videos into K = 18 
clusters, followed by the Hungarian algorithm [5] to opti- 
mally match the estimated cluster to the ground-truth in- 
dices. Since K-means converges to different local minima 
depending on the random initialization, we executed the al- 
gorithm 512 times and then reported the average of accuracy 
for the clustering with lowest sum of squared distance. 


Results: Table 1 shows the clustering accuracy results. All 
three methods yield accuracies that are much greater than 
the random baseline, which achieved only 18.27% accuracy. 


Weighted Frequencies: We tried multiple values of r (r = 1 
is equivalent to the Binarized Token Counts). We also added 
r = 0.5,0.25; this contrasts with our intuition for when it 
weights t=1 more; we added this as a sanity check that the 


accuracy should be getting worse. Table 1 shows that the 
weighted frequencies increase the accuracy significantly up 
by 10% on average. r = 2 performs the best among r = 
2,4,8. As r gets larger, we see a slight decrease in accuracy. 


1D vs. 2D: Table 1 (right) shows clustering accuracy with the 
2D approach. For the radius k = 2 on the Expression token 
(and using weighted frequencies with r = 2), the accuracy 
increases around 2% compared to with 1D. However, we can 
see lots of variance in the accuracy over the different k, and 
hence the advantage may not be statistically reliable. 


Comparison to TF-IDF: Our token summarization methods 
can be seen as variations of TF-IDF, where only the TF 
term f(a) is used; in other words, we used a constant 1 for 
the IDF term. (We experimented with several IDF func- 
tions but found that they all worked worse than just 1.) 
The weighted frequency scheme we tried can be seen as a 
coarse (piecewise-constant) approximation to the (smooth) 
log function commonly used as the TF function in TF-IDF. 
Using TF-IDF (with log for TF and 1 for IDF) and Expres- 
sion Tokens, the clustering achieved 78.67% for the Expres- 
sion Token (down about 5% from our weighted frequency 
method). For the Individual Tokens, it performed similarily 
in accuracy compared to the weighted frequency methods. 
(See the “log” column in Table 1.) In summary, the results 
provide some evidence that our text representations may 
yield a worthwhile accuracy advantage over TF-IDF. 


6. SEARCHING 


Here we explore whether the proposed text representations 
could be used to create a simple search engine to reduce 
the amount of video time they would need to watch. Us- 
ing the text representations, we can build a simple search 
engine as follows: (1) From each video 7 in a collection S, 
we transcribe its speech into text (using Google ASR) and 
then extract its text representation v;. Then, (2) for any 
search query (e.g., “Simplify: log, 16”), we likewise extract 
its text representation q using any of the methods presented 
in Section 3. Finally, (3) we rank all the videos in S by the 
cosine similarity between v; and q. 


Experiments: Here we consider a general setting in which 
multiple math problems may be explained in a single video. 
A search engine that can pinpoint which segment of a video 
explains the solution could save the user significant time 
compared to watching the whole video. For this setting, 
there is a trade-off between granularity and accuracy: the 
search engine may be more accurate if the segment length 
is longer, but the user can save more time if the segment 
returned to them by the search engine is shorter. Hence, 
we introduced a segment length parameter, L. We divided 
each video into multiple segments of length L. Each segment 
has its own (sub-)transcript and its own problem that it 
explains. Hence, we treat each segment as its own “video”. 
Our goal is to find any segment in the video that explains 
the problem in the user’s query qg. As a baseline, we used a 
simulation (averaged over 20 runs) to estimate the sum of 
the segment lengths (in seconds) that a user would have to 
watch before finding a relevant segment. 


Results on the Algebra dataset: We analyzed the 234 videos 
of the Algebra dataset that contain multiple problems; in 
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1D 2D 

Token Types |} Raw Weighted: r log Radius k (for Weighted: r = 2) 

0.25 0.5 1 2 4 oe) 1 2 4 8 16 oe) 
Individual 54.33 | 25.48 | 41.35 | 67.31 | 72.11 | 68.75 | 64.90 | 73.56 |} 64.90 | 63.46 | 59.13 | 57.69 | 62.02 | 49.04 

Expression 50.48 | 20.67 | 44.71 | 70.19 | 83.65 | 83.17 | 80.29 | 78.67 || 83.65 | 85.58 | 75.96 | 71.64 | 73.56 | 51.44 


Table 1: Clustering accuracy on the logarithm videos for 1D text representations with different token types and summarization 
methods, and the clustering accuracy for 2D representations and different token types (all with weighted summarization: r = 2). 


total, these videos explain 300 algebra problems. We varied 7.1 Prediction Models 


the segment length L over the set {15s, 30s, 1m, 2m, 4m} We considered both linear models with mixed effects, as 
(see Figure C.1 in Appendix). The results shows that the well as deep non-linear models based on neural networks, 
best text representations were 2D Binarized Individual To- but we found that the latter overfit too easily and gave 
ken (k = 8). In particular, the 2D representations showed unstable results; hence, we present only the linear models. 
an advantage (compare the pairs of {blue, pink}’s solid and Let pij,j = 1,2,3, be student 2’s prior knowledge (pretest 
dashed lines). We found that radius k = 8 for 2D Represen- score) within the 3 problem categories (j) on logarithms. 
tation preforms best across each method. For the Interval Let cij,j = 1,2,3, be 0-1 indicator variables that reflect 
Length, the percent decrease, at L € {30s,1m}, in watch whether student 7’s assigned video belongs to each category 
time is highest (i.e., the most helpful, see Figure C.1 in Ap- j. (Note that each video is assigned to exactly one of the 
pendix). As L continues to grow, the results go down and three categories.) We can compute c;; using either (a) Man- 
at L = 15s, the performance drops. This exemplifies the ually Labeled Categories (MLC) from human annotators, or 
trade-off between segment length and available information. (b) Automatically Labeled Categories (ALC) from the text 


representations and clustering algorithm (Section 5). 
Results on the Logarithm dataset: In this dataset, each video 


contains one log problem. For each of 18 logarithm prob- Prediction Model: We constructed a model that consid- 
lems, we search for any of the videos that solve that partic- ers multiplicative interactions between the student’s prior 
ular problem. Comparing the results with random baseline, knowledge p;; in each problem category and the cluster la- 
the results show the same trend as for the Algebra dataset: bel ci; of the student’s assigned video: 

The 2D Representation gives the best results. We found, Yi = ee (wy pig + vjciy + Uy (Piz X Ciz)) + &. Importantly, 
for instance, Binarized Individual Token yields the results this model contains multiplicative interaction terms pi; x ci;- 
of 89.96%, and 93.19% for 1D and 2D (k = 8), respectively. 

The same holds true for Weighted Expression Token (r = 2) Results: We found that the interaction pi; x ci; using MLC 
with the results of 91.25% and 93.20%. For the 1D approach, has a statistically significant effect on the learning gain (Fi, 582 = 
the best representation was TF-IDF (with log for TF and 5.839,p = 5.1le — 09), and so does this interaction us- 
identity for IDF); the reduction was slightly lower (92.85%). ing ALC (Fi1,582 = 6.425,p = 4.125e — 10). The RMSE 


is 0.464, which is slightly better (about 3.1% relative de- 

crease) compared to prediction model 1. Specifically, we 

found that, for example, u3 is negative and statistically sig- 
7. LEARNING GAIN PREDICTION nificant (p = 0.0005) in the ALC. model. The erin of 
In this section, we investigate whether the text representa- ug means that, if pis x cig is low, then the learning gain is 
tion can be used to predict the learning gain of students high (and vice versa). In turn, pi3 X cig is low either be- 
who watch the videos as an educational intervention. The cause (1) pi3 is low and cz = 1, i.e., an individual knows 
high-level idea is that the effectiveness of each tutorial video little about topic 3 and receives a tutorial about topic 3, 
can be estimated by the interaction of the content within yielding high learning gain; or (2) piz is high and ci3 = 0, 
the video and the student’s prior knowledge. In contrast to 


: ; : . ie., an individual already understands topic 3 and receives 
some prior work that predicted the average learning gains on another (more helpful) topic, yielding high learning gain. 
of a video over many students, here we tackle the arguably 


Rona bl f dict; ndividual | : f Both the MLC and ALC interactions were stat. sig., suggest- 
auger |PEOD OH OL Pre@eling URGLULAU GY PEGTTIENG OMS 0 ing that the text representations can group videos in ways 
each student, measured as the difference in test scores on 


: ; : that predict individual learning gains. 
the curriculum before and after watching the video. 


The Logarithms dataset (Section 4) contains pretest /posttest 8. CONCLUSION AND FUTURE WORK 


scores of students who received a tutorial videos as an inter- We have devised novel text representations to represent the 
vention. Hence, we use each participant’s pretest score and content of math tutorial videos. On a dataset of hundreds of 
the text representation of the video they watched as predic- math videos and hundreds of students who watched them, 
tors to estimate their learning gains (posttest minus pretest we showed that the representation can be used to (1) accu- 
score). Rather than use the text representation as a fea- rately (around 85%) cluster the videos into the math prob- 
ture vector itself, we instead use the category label assigned lems they solve (RQ1); (2) search for specific video content 
to the problem (Section 5) by the clustering algorithm as a in a large repository of videos, thereby saving the user con- 
0-1 indicator variable with an associated model coefficient; siderable (up to 88%) search time (RQ2); and (3) predict 
hence, our models can find interactions between a student’s individual learning gains, in conjunction with features of the 
prior knowledge and the topic in the video they received. students’ prior knowledge, with stat. significance (RQ3). 
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APPENDIX 


A. DATASET 


Here we described each dataset we use in the experiment in 
more detail. 


Logarithms: This is the dataset collected by Whitehill & 
Seltzer [14], which contains both a repository of 208 math 
tutorial videos about logarithms. Most videos are between 
1-3 minutes long. In total the collection spans 18 logarithm 
problems, with 9 to 17 videos per problem. Relevant only 
to Section 7, the dataset also contains students’ pretest and 
posttest scores of 541 participants from Amazon Mechan- 
ical Turk who watched the videos. There are 226 males, 
207 females, and 108 of undefined, with the average age of 
33.71 +9.84. Specifically, each participants were asked to 
answer 19 logarithm pretest problems, which was classified 
into 3 main categories: (1) the logarithmic term without 
variables e.g. logy 1, (2) the logarithmic term with variables 
e.g. log,, 4, and (3) the logarithmic equation e.g. solve for 
x where x log, 16 = 3 (category 1, 2 and 3 contain 102, 61, 
and 45 videos, respectively). Then, they were assigned to 
one random video among 208 logarithm tutorial videos, and 
were asked to complete a posttest (same level of difficulty 
as the pretest but slightly different problems). 


Algebra: For the search task, we collected another dataset, 
containing 234 algebra math tutorials on Youtube As of 234 
videos, 213 of them contains one math problem and 21 of 
them contains multiple math problems (total of 87 math 
equations); total of 300 expressions on entire dataset. We 
manually annotated which equation (e.g. 277 — 24-12 = 0, 
x +7 = 10) each video explains. For videos with multiple 
math problems, we marked the start end time of each. 


B. SPEECH-TO-TEXT TRANSCRIPTION 


All the feature types we explore are based on obtaining an 
approximate transcript of the video from an ASR. In par- 
ticular, we use Google Speech-to-Text API. As a pilot test 
of its accuracy on the OERs in our dataset, we manually 
annotated 10 videos (in total of 3044 words in the ground- 
truth transcripts). Google’s API achieved a word error rate 
(WER) of 5%, which intuitively seemed sufficient, which in- 
tuitively seemed sufficient. An example of extracted speech 
is shown in Figure 1 (caption). After obtaining the tran- 
script for each video in our collection, we then tokenized it 
and summarized the token frequencies. 


C. ADDITIONAL FIGURES 
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Figure C.1: The decrease in time needed to find specific 
math content in a set of math tutorial videos. Each line 
shows a different text representation over different segment 
lengths. 
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ABSTRACT 


Self-evaluation is a key self-regulatory process that can al- 
ready be mastered by young children. In order to assess 
self-evaluation skills of children, we introduced a random 
prompt asked randomly after 1 out of 15 exercises into a 
literacy web-application for primary school student, in order 
to evaluate the perceived difficulty [Too easy, Good, Too dif- 
ficult] of the exercise they just solved. Comparing students’ 
actual performance with their responses to this prompt can 
provide information about their ability to self-evaluate, and 
thus detect students who could improve their self-evaluation 
skills. We collected more than 1,000,000 responses from 
300,000 students and used these data as well as performance 
data on each question of each exercise to predict a student’s 
response to the next prompt, thereby estimating how likely 
they are to having a self-evaluation deficit. The results show 
(a) that a student’s past responses to self-evaluation state- 
ments impacts the quality of future predictions (b) that the 
impact of past responses - vs their current performance - is 
greater when the student has low capacity for self-evaluation 
(c) that including older student data (answers from several 
sessions ago) helps in improving the accuracy of the predic- 
tion. These results pave the way (1) for adaptive polling 
by identifying when the model is unreliable, giving them 
the statement then instead of randomly, (2) for adaptive 
feedback, by knowing the students the most likely to show a 
deficit, to provide remediation. 


Keywords 
Self-Regulated Learning, Primary school, Self-evaluation, 
Prediction, Remediation, Adaptive polling 


1. INTRODUCTION 


Improving children’s self-regulated learning (SRL) skills is 
a key component of their academic performance, as self- 
regulated students generally know better “how to learn”, 
which can have a positive impact in all disciplines [22]. A key 
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SRL process is self-evaluation [17], which is a skill already 
developed in children as young as 5 years old [19]. It is 
therefore a particularly interesting SRL aspect to target 
when working with young children. The most reliable way 
to assess SRL deficits is through direct questions to the 
students [1], but constant prompting can lead to an overall 
degraded perception of the learning environment [3] and it is 
therefore critical to limit prompting to the minimum. Hence 
we are interested here in trying to predict students’ answers 
to assessment on perceived difficulty which are used to assess 
students’ tendencies to overevaluate or underevaluate. The 
second aspect we investigate is relative to the features that 
are the most relevant for this prediction. 


2. RELATED WORK 


The EDM community has recognized early on the interest 
of studying SRL through data mining [21], and previous 
works have been more particularly interested in detecting 
SRL behaviors from traces [5], discussion forums [9, 8] or 
proxy behaviors such as gaming the system [4], explaining 
such behaviors with sequence mining [20, 2], mixture mo- 
dels (for procrastination, a proxy of SRL) [13] or coherence 
analysis [18]. Other works have focused on analyzing the 
differences in use of SRL strategies [6], providing feedback 
to encourage them [7] or predicting their use [15]. However, 
as far as the authors are aware, no previous work has specifi- 
cally attempted to predict how a student would answer to 
a question aiming to measure a SRL deficit (self-evaluation 
or any other), and none of the aforementioned work focused 
on young children (5-7 years old). It is worth noting that 
although young children’s abilities to use SRL strategies 
may be more limited than in teenagers, they seem to have 
comparable monitoring skills [16]. Indeed, recent work on a 
dashboard supporting SRL in a mathematics software pro- 
gram for 9-10 years old (only slightly older than our targeted 
students) showed a significant improvement in SRL skills for 
students in the dashboard group compared to those without 
the dashboard [12]. 


3. SELF-EVALUATION ASSESSMENT 
3.1 Context 


Lalilo is one of the many web applications used by teachers 
in the classroom to help them implement a differentiated 
pedagogy. At the beginning of 2021, it is used by 40,000 
English and French speaking kindergarten and elementary 
classes every week to strengthen literacy through series of 
exercises adapted to the students’ level, while providing the 
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teacher with a dashboard to evaluate the students’ activities 
and progress. It is therefore a relevant testing ground for 
evaluating and then trying to correct some students’ SRL 
deficits. A typical session lasts 20 minutes (on average) with 
the student performing around 15 short exercises with 3 to 
7 questions each. Student activities (e.g. logging in, time 
spent on an question/exercise, mistakes) are traced and we 
will focus on students’ answers to an exercise, thus we will 
call trace only the answers to this set of questions of the 
same type within an exercise. 


3.2 Data collection 

To assess some aspects of students’ SRL skills, we introduced 
a random prompt is once every fifteen exercises when a 
student finishes an exercise (i.e. on average once per typical 
learning session). This prompt includes a perceived difficulty 
statement asking the student “How difficult was this exercise 
for you?” with 3 possible answers: “Too hard”, “Just-right”, 
“Too easy”). Comparing the answer to the perceived difficulty 
statement with the real performance aims at measuring the 
self-evaluation ability of the students, i.e. their ability to 
correctly estimate the difficulty of the questions they just 
answered. Before introducing the assessments, we checked 
qualitatively in a classroom using Lalilo that statements were 
understood by 1** grade students (details not presented here). 
We collected traces from Kindergarten, 1%* grade and 2™4 
grade classes based in France, Canada and USA learning 
in French (FR) or English (EN) between January 18 and 
February 24, 2021 on the Lalilo platform. We kept only the 
traces for which students had answered to a prompt and 
further on we call trace the answers to the exercise with the 
associated answers to the prompt. 


4. METHODS 
4.1 Dataset 


Given the history of a student answers to the perceived diffi- 
culty prompt and their performance on the current exercise, 
we want to predict which answer is the most likely to be given 
by the student to the next prompt (and thus extrapolate 
their self-evaluation skills). If the student’s performance - 
which will be defined in the Feature engineering subsection 
- was “excellent” (resp. “poor”) and they answered “Too 
hard” (resp “Too easy”), they were considered as having an 
underevaluation (resp. overevaluation) deficit. We filtered 
out students who had strictly less than 8 traces with answers 
to prompts so that our model would not overfit on results 
of students with very few answers, as students with very 
few answers were overrepresented in our initial dataset. We 
finally had 424,173 traces with an answer to the perceived 
difficulty statement from 34,083 students having on average 
12 traces with self-evaluation answers (SD = 5.93). 


4.2 Feature engineering 

We engineered several new features that could have a pre- 
dicting power in our results. 

Basic performance feature. In addition to the trace and 
student IDs, used for filtering but not for prediction, we ex- 
tracted for each trace the answer correctness list, a boolean 
vector of a length of 3 to 7 (number of questions per exercise). 
Enriched performance features. From the answer correctness 
list, we extracted 5 additional features: the good answer 
count (i.e. the number of 1s in the vector - the higher the 


value, the better the student may feel they have succeeded), 
the total answer count (i.e. the length of the vector), the 
success rate (i.e. the ratio between the good answer count 
and the total answer count) and the second half success rate 
(i.e. the success rate on the last half of a trace - a student 
with self-evaluation deficit may suffer from a recency bias, 
influencing positively [resp. negatively] their perception of 
their performance when answering correctly [resp. incor- 
rectly] to the last questions of the exercise). 

Exercise features. We hypothesize that self-evaluation deficits 
are not uniform across one’s knowledge. In particular, a stu- 
dent’s deficit may be stronger in some types of exercises 
or on exercises about a given topic. To assess this impact, 
we added 5 features that relate to the exercise finished just 
before the two difficulty statements: exercise template (for 
example a multiple choice question or a word composition 
exercise), lesson index (there are around 1,000 lessons in 
Lalilo - although they are not entirely linear, the higher the 
value of this feature, the more advanced the content is; when 
working on English (resp. French) data, the lesson index FR 
(resp. EN) is empty), lesson type (lessons are organized in 
a tree structure - lesson type represents the first level cate- 
gory), lesson subject (lesson subject represents the second 
level category in the tree), language (English or French). 
Previous feedback modalities. In order to help students’ in 
their performance assessment, they are randomly given an 
audio feedback (such as “In the last exercise, you found 3 
correct answers of 5 questions”) and/or a visual gauge of as 
many green ticks and red crosses as they had good and bad 
answers in the previous exercise. Even when these synthe- 
sis exercise-level performance indicators are not there, the 
students always have an immediate question-level true/false 
feedback. We have previously shown the positive impact of 
these two indicators in correcting some self-regulation issues, 
and we therefore hypothesize that they need to be taken 
into account when predicting how the student will assess the 
difficulty. Hence we added two binary features, gauge and 
audio which indicate whether these performance feedback 
were given before the two difficulty statements. A feedback is 
also provided after answering to the prompt, when students 
display a self-evaluation deficit (as defined in 3.2), encourag- 
ing them to be more confident or warning them to be more 
careful; we therefore encode this as a third binary feature, 
remediation, indicating whether the student received a feed- 
back the last time the difficulty was assessed. 
Self-evaluation deficit lag features. Self-evaluation deficits are 
expected to be a recurring phenomena in students’ answers, 
i.e. a student who has under/overevaluated themselves a few 
times is likely to under/overevaluate themselves again in the 
future. Hence we added 3 lag features for the last 3 perceived 
difficulty assessments. Moreover, since it is possible that the 
last 3 assessments were not allowing the student to exhibit a 
deficit (e.g. a student cannot appear to be overevaluating if 
their performance is at 100% on the last 3 exercises where 
they were asked to assess the difficulty), we also added 3 lag 
features for the last 3 perceived difficulty assessments where 
the student’s performance was equal to the performance on 
the current exercise. Performance is a categorical value which 
is worth “poor” if the success rate is below 34%, “excellent” 
if the success rate is at 100% and “medium” otherwise. 
Overall previous self-evaluation deficit. If students are stable 
over time in their assessment, we expect that taking into 
account the whole history would have a positive impact on 
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the prediction. We therefore introduced 5 additional fea- 
tures: self-evaluation answer rank (the number of times the 
student has been asked a self-assessment), number of “Too 
easy” (resp. “Too hard”) answers (the number of times the 
student previously answered “Too easy” (resp. “Too hard”) 
to previous assessments), and “Too easy (resp. Too hard) 
answers ratio” (the ratio between the two previous features). 
Additionally, similarly to what was done for the lag features, 
we also considered the 5 equivalent features exclusively on 
previous assessment given after an exercise with a similar 
performance level. 


4.3 Algorithms 

We used Catboost [14] and LightGBM [10] to perform the 
predictions, two gradient boosting algorithms based on Deci- 
sion Trees whose main assets are: (a) their ability to natively 
deal with categorical features and (b) their explainability, 
allowing to study feature importance in each prediction with 
SHapley Additive exPlanations (also called SHAP values) 
[11]. They also have recently won Kaggle competitions on a 
variety of datasets. We used MultiClass as a loss function, 
set the number of iterations at 200 and kept the other hy- 
perparameters at their default values. As their results were 
very similar for the global prediction task (cf. Table 1), we 
used CatBoost model only for the other tasks. 


4.4 Analyzing features importance 

We first measured the improvements allowed by feature en- 
gineering to the global prediction scores using 5-fold cross 
validation and stratifying by student so that no student traces 
are both in training and testing folds. For all feature impor- 
tance measures subsequently described, we created a training 
and a testing set - also stratified by student - and measured 
the feature importance in the testing set. We studied the 
importance of various features in our model using the SHAP 
package [11]. We also compared the importance of features 
across the three classes so as to highlight the features that 
have the most impact on each class specifically. 

If students showed a given deficit regularly - we defined a 
threshold at 50% of “underevaluation” (resp. “overevalua- 
tion”) over the traces with 100% (resp. below 34%) success 
rate - we tagged the student as having an “underevaluation 
(resp. overevaluation) deficit”. We then trained our algo- 
rithms on a dataset with students tagged as having a deficit 
and on a dataset with students tagged as having no deficit. 
Our hypothesis was that feature importance would vary be- 
tween these two models: the predictions were expected to 
depend more on past answers than current performance for 
students tagged with deficits compared to the other students. 
Finally, we measured the evolution of the performance of the 
model depending on the self-evaluation answer rank that is 
predicted. To do so, we analyze the evolution of Cohen’s 
Kappa coefficient, measuring the quality of our prediction 
of the perceived difficulty, on a cohort of students of the 
testing set having answered a given amount of prompts. We 
computed this coefficient for each self-evaluation answer rank 
and expected the quality of the prediction to improve as 
students would be better and better characterized by their 
features. 


5. RESULTS AND DISCUSSION 


5.1 Global features importance analysis 

Firstly, both Catboost and LightGBM allow to predict with 
a reasonable overall performance the students’ perceived 
difficulty (cf. right part of Table 1). Secondly we see that 
all features additions improve the model except the previous 
feedback ones. Specifically, adding features describing what a 
student did in the past improves the predictions significantly. 


Table 2 ranks the top 10 features with their mean absolute 
SHAP values. Interestingly, the success rate of the student 
and the success rate on the second half of the correctness 
list are only the 5'* and 8" ranked features. It means that 
past information about the previous answers of students to 
self-evaluation prompts influences more the prediction than 
their current performance, although they are being asked 
“How difficult was this exercise for you?”. Specifically, the 
last three answers to the perceived difficulty question bear a 
significant weight in our model’s predictions, as well as the 
global “Too easy” and “Too hard“ ratios. As expected, the 
“Too easy” ratio has a huge importance for the “Too easy” 
class as has the “Too hard” ratio for the “Too hard” class 
and both ratio are highly ranked for the “Just-right” class. 
Indeed, we did not input the “Just-right” ratio as the model 
can learn it from the combination of “Too hard” and “Too 
easy” ratio. We can note that the success rate feature is 
mainly important for the “Too easy” and “Too hard” class, 
which is logical as an excellent (resp. poor) performance is 
not likely to lead to a “Too hard” (resp. “Too easy”) answer. 
We can also see on Figure 1, as the “Too hard” ratio is 
equal to 0, it drives the prediction score of the “Too hard” 
class downwards while it drives the prediction score of the 
“Just-right” class upwards. Furthermore, the 3 last answers 
to the perceived difficulty statement of this student were 
“Just-right”, “Too easy”, “Too easy” and their success rate on 
this trace was 100%; the predictions of the “Too easy” class 
are therefore also driven upward by these features. 


5.2 Features rank from self-evaluation deficits 
Figure 2 shows the feature importance rankings for students 
detected as having or not self-evaluation deficits. Students 
with deficits consistently choose how to answer to the prompt 
more based on past answers (in particular the “Too easy /hard” 
ratios), as opposed to students with no deficits who rely more 
on the success rate to this exercise, as one should. These 
results are in line with our hypotheses. 


5.3 Predicting power based on answer rank 
Figure 3 shows the kappa evolution depending on the number 
of past self-evaluation assessments. The kappa for the first 
answer is around 0.13, then quickly climbs around 0.4 for the 
next four traces; and finally slowly increases until plateauing 
around 0.6. The kappa of 0.13 for the first answer is consistent 
with Table 1: at the beginning, Student features are empty 
and the model can only rely on Trace features. With Trace 
features only, the model reached a Kappa of 0.1084 which 
coincides with the kappa value of 0.13 in Figure 3. We can 
then deduce that the model is more and more able to predict 
answers to the perceived difficulty statement. 
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Table 1: Prediction metrics after 200 iterations of the Catboost and LightGBM algorithms depending on the features used. We 


measure mean and standard deviation (in parenthesis) on 5-fold cross validation. 


Trace features Student features 


Algorithm Enriched perf. Exercise Feedback | Lag Overall prev. self-eval. | Accuracy Kappa 
CatBoost Yes 0.4662 (0.0006) 0.0866 (0.0030) 
CatBoost Yes Yes 0.4737 (0.0018) 0.1084 (0.0022) 
CatBoost Yes Yes Yes 0.4736 (0.0018) 0.1081 (0.0026) 
CatBoost Yes Yes Yes Yes 0.6676 (0.0034) 0.4522 (0.0053) 
CatBoost Yes Yes Yes Yes Yes 0.6701 (0.0028) 0.4575 (0.0042) 
LightGBM Yes Yes Yes Yes Yes 0.6706 (0.0034) 0.4565 (0.0032) 
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Figure 1: Feature impact in the prediction of each class for a randomly chosen trace in the testing pool. 
Top: “Too easy” class, middle: “Too hard” class, bottom: “Just-right” class. 
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Figure 2: Top 10 features impact for class prediction of 
students detected as having (top) or not (bottom) a self- 
evaluation deficit 
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Figure 3: Kappa value and total number of traces of each class 
in the testing group, based on the self-evaluation answer rank 
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Table 2: Average feature importance rank by class, sorted by 
total SHAP value (top 5 in bold) 


Features Too easy Just-right Too hard 
“Too easy” ratio 1 1 14 
1 to last perc. answer 2 2 3 
“Too hard” ratio 6 4 1 
2 to last perc. answer 3 3 9 
success rate 4 9 2 
3 to last perc. answer 5 5 10 
1 to last perc. answer P 7 6 12 
second half success rate 8 19 4 
3 to last perc. answer P 9 v4 11 
exercise type 10 15 5 


6. CONCLUSION AND FUTURE WORKS 


Using a large volume of trace data from primary school 
students, we leveraged students’ past data to significantly 
improve the prediction of the answers to future self-evaluation 
prompts. The results also indicate that the more data we 
have about a student, the better our predictions are. Using 
feature engineering, we ranked features by the additional 
predicting power they provide, and found results consistent 
with SRL theories (in particular that prediction of answers 
for well-regulated students depends mostly on their success 
rate). This paves the way for adaptive polling (as opposed to 
the current random one), prompting only students likely to 
display a self-evaluation deficit, allowing us to better target 
remediation. The main limit of the current work is the 
specificity of the context: it would be particularly interesting 
to study the main features used in another context with a 
different type of students. We are also targeting one of many 
existing SRL deficits, and expanding research on predicting 
other deficits to encourage the training of multiple SRL skills 
seems important as well. Future works also include further 
feature engineering to refine what features may have more 
impact than the current ones. 
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ABSTRACT 


We describe a data mining pipeline to convert data from 
educational systems into knowledge component (KC) models. In 
contrast to other approaches, our approach employs and compares 
multiple model search methodologies (e.g., sparse factor analysis, 
covariance clustering) within a single pipeline. In this preliminary 
work, we describe our approach's results on two datasets when 
using 2 model search methodologies for inferring item or KCs 
relations (i.e., implied transfer). The first method uses item 
covariances which are clustered to determine related KCs, and the 
second method uses sparse factor analysis to derive the relationship 
matrix for clustering. We evaluate these methods on data from 
experimentally controlled practice of statistics items as well as data 
from the Andes physics system. We explain our plans to upgrade 
our pipeline to include additional methods of finding item 
relationships and creating domain models. We discuss advantages 
of improving the domain model that go beyond model fit, including 
the fact that models with clustered item KCs result in performance 
predictions transferring between KCs, enabling the leaming system 
to be more adaptive and better able to track student knowledge. 


Keywords 


Knowledge component model, domain model, learning transfer 


1. INTRODUCTION 


This paper describes preliminary progress to create a multimethod 
pipeline to determine the knowledge model (or domain model) that 
allows the most accurate prediction of performance in an adaptive 
learning system using a quantitative model of practice. A broad use 
of quantitative models of practice is to predict performance and 
make pedagogical decisions [1; 2]. To do this effectively, models 
typically assign sets of problems or items specific skill tags (often 
called knowledge components, or KCs). Having such an 
identification allows a system to monitor which skills have been 
learned and which need more practice. The matrices representing 
these item assignments to skills are called Q-matrices [4]. Because 
the act of tracing student learning is so important for pedagogy, the 
assignment of items to KCs is crucially important for systems to 
make pedagogical decisions. Without such an assignment, a system 
would conceivably need to schedule all items for practice to ensure 
mastery, so the assignment or “domain model” must be accurate for 
a system to perform well. Improvements in the domain model may 
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result in better pedagogical decisions in a system. This paper 
describes a more general approach to improve these critical domain 
models, a tradition that has included much prior work [3; 5; 14; 15; 
19; 20; 22]. 


In addition to improving domain models, we highlight how these 
methods may alter how many quantitative models work by enabling 
models where multiple knowledge components can influence a 
single practice trial. While such models are not new [13], 
specifying them with experts is time consuming and error prone. 
Despite this difficulty, domain models that include the potential for 
multiple KCs affecting a single performance also typically capture 
transfer when a shared KC is used in multiple items. In addition to 
making models more accurate, this transfer has large potential 
impacts on pedagogy in a complex adaptive instructional system 
since transfer in an adaptive system means that a KC's performance 
may bias the selection of other items that share KCs. This transfer 
will occur because the shared KC will affect the item predictions, 
making items sharing a KC more or less likely to be practiced. 


2. ANALYSIS METHOD 


We have developed an automatic domain model improvement 
algorithm with a highly configurable analysis pipeline. 


2.1 Step 1 


First is the preprocessing stage. In this stage, some matrix-based 
method will produce some featural vector of information 
representing each item. There are two ways this method might 
process the data from an educational system, either all at once or 
sequentially in the order the student saw the items. In the first case, 
this would include methods such as SPARFA-Lite, which assumes 
one observation for each skill for each student [14]. Our example 
in this paper uses the SPARFA-Lite model and a simpler model 
based on covariance clustering [20]. For our examples, Step 1 
meant averaging KCs performances for each subject to get a student 
performance by KC. More advanced methods such as tensor 
analysis can proceed with sequential data for each student. 
However, this is future work not presented here. 


2.2 Step 2 Infer Feature Matrix 


In this step, the method is applied to the data to get some matrix. 
Currently, the pipeline has two possibilities at this stage, but we 
plan to include multiple methods in future work as we look to our 
long-term goal of building a shareable tool for the EDM 
community. 


2.2.1 Covariance Clustering 

Developed by Pavlik, Cen, Wu, and Koedinger [20], covariance 
clustering is a method to describe how each item or existing KC in 
a domain model is related to all other items or KCs (using a measure 
of conditional log odds to represent covariance). This method 
computes a vector for each item representing the conditional 
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probability table for success and failure for the items/KCs relative 
to all other items/KCs. The pairwise relationships between each 
vector are similar to the relationships inferred in POKS (Partial 
Order Knowledge Spaces, [7; 8]), a method related to Falmagne’s 
work [10; 11]. An advantage of covariance clustering is that it 
characterizes each pairwise relationship between items/KCs in 
terms of the relationship with all other items/KCs. Pavlik et al. [20] 
used clustering to establish how to group items by using this 
KC/item relational vector as the set of features. 


2.2.2 SPARFA-Lite 

Developed at Rice University by Lan, Studer, and Baranuik [14], 
SPARFA is a factor analysis method to extract factors from binary- 
valued data. It provides an association matrix similar to adAFM Q- 
matrix with graded associations of concepts with items. The “Lite” 
version simplifies the method by reducing the parameters and 
allowing automatic inference of the optimal number of concepts. 
This method works differently than dAFM, but it provides similar 
results, allowing for direct comparisons. Also, the ability to infer 
the optimal number of concepts may be a useful constraint when 
applying other algorithms. 


2.3 Step 3 Cluster Principal Components of 


Features 

In this step, the information matrix is clustered using some method 
to group items into clusters. Our current implementation first uses 
RSVD (Randomized Singular Value Decomposition) to simplify 
the information matrix. We see from the pattern in the results 
section how the quantity of RSVD components influences the 
clustering result. We are currently using K-means clustering for 
clustering, so our search is across both RSVD number of 
components (N) and K for the number of K means clusters. 


For this step, we have also done considerable experimentation with 
the cmeans fuzzy clustering method, which provides a 0 to 1 index 
of how strongly each KC is associated with each cluster. Typically, 
we have used this by specifying a threshold (which can be 
optimized with search) over which an item belongs to each cluster 
or not. This assignment allows for membership in multiple clusters, 
which means that unlike the method in Step 2.4, the item is assigned 
to potentially many clusters. Typically, when we use this method, 
we have weighted the effect of prior practice for the KC clusters 
according to the number of KC clusters involved in a performance. 
This weighting is not necessary for the simpler K-means 
implementation since the added KC column only assigns each KC 
to 1 KC cluster. 


2.4 Step 4 Fit with New Model 


We used the new model as an overlay such that we created a column 
with the cluster id for each KC for each trial. This overlay 
procedure means that while the KC and clusters are independent, 
practicing an item may affect other items if they share a cluster. To 
do this, we first describe our starting model, which was simply PFA 
(Performance Factors Analysis, [17]) using the logarithms of 
practice counts for successes and failure (adding 1 to each to permit 
the logarithms). Where @ values are Student ability and KC 
difficulty respectively, and S$ and F represent the count of prior 
success and failures for the KC j for the student i. 


logit(pije) = Biloge Sij + Brloge Fij + 
0; + 6; 


The new model was defined using cluster-id (c) as a KC in an 
additive compensatory model. Prior research suggests such 
compensation among KCs works well for prediction [6; 16]. 


logit(pije) = Brjloge Si; + Bojloge Fij + 
B3cloge Sic + Pacloge Fic + 
6; + 0; 


Two versions of this model with clusters were compared; the first 
version was as described, and the second version was a control 
condition where the cluster column was sampled at random from 
the Q-matrix. This control condition should exhibit the same 
amount of overfitting due to adding parameters but none of the 
benefit of a coherent clustering solution. These models are 
compared using 2 to N components and 2 to K clusters by iterating 
to Step 3 to search a space of models. 


In the context of our future work, we plan to allow users of our tool 
to specify candidate models with different configurations and terms 
using a logistic knowledge tracing R package freely available [18]. 
It is possible that different learner models may be implemented at 
this step since the Q-matrices we are creating may be used in many 
types of learner models. 


2.5 Step 5 Splitting and Merging 

Just as steps 3 and 4 may iterate to find optimal K and N, steps 4 
and 5 may iterate to refine the model in Step 4. This step describes 
our future planning for a tool to optimize Q-matrix type models of 
knowledge domains. 


Splitting takes the original KC model and uses the KC model from 
the clustered features to determine hypotheses for how KCs might 
be split. So if a KC in the original model is in 2 clusters, the model 
would test whether that that was best represented by the default 
model (include the effect of the cluster and the KC for each KC) or 
whether the cluster was unnecessary and the fit was just as good by 
splitting the KC into two different KCs and dropping the effect of 
the cluster KC. Further, we could also test whether the specific 
clusters proposed for each KCs even improves fit by removing 
them entirely as a third hypothesis. Two of these three possibilities 
correspond to Learning Factors Analysis (LFA) [5], and the third 
(including the cluster instead of using a split) advances the 
approach. 


Merging uses the cluster model like LFA, but instead of splitting 
KCs, the clusters are used to evaluate three hypotheses about 
whether existing KCs can be merged into a single KC. One 
hypothesis is that the KCs are best represented as separate but 
influence each other through the shared cluster membership. The 
second hypothesis is that the specific cluster was unnecessary, and 
the two KCs should be merged into 1 KC. Finally, the third 
hypothesis is that the 2 KCs are separate and that the cluster 
predictor should be omitted. 


Step 5 is similar to backward and forward stepwise regression 
methods, and so it is clear this method would be very likely to cause 
overfitting due to the way it will tailor the model term to capture 
the data iteratively. To prevent this problem, the solutions produced 
are robustly cross-validated. By tuning the model to maximize 
cross-validation accuracy, we aim to find quantitative thresholds 
for when to add or subtract terms from the model with a result that 
is efficient and parsimonious. 
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Figure 1. Changes in fit including differing numbers of additional 
clusters for Cloze dataset using covariance clustering (CC) or 
SPARFA-lite (SL). 


3. DATASETS 


The statistics cloze dataset included 58,316 observations from 478 
participants who learned statistical concepts by reading sentences 
and filling in missing words. Participants were adults recruited 
from Amazon Mechanical Turk. There were 144 KCs in the dataset, 
derived from 36 sentences, each with 1 of 4 different possible 
words missing (cloze items). The number of times specific cloze 
items were presented was manipulated, and the temporal spacing 
between presentations (narrow, medium, or wide). The post- 
practice test (filling in missing words) could be after 2 minutes, 1 
day, or 3 days (manipulated between students). The stimuli type, 
manipulation of spacing, repetition of KCs and items, and multiple- 
day delays made this dataset appropriate for evaluating model fit to 
well-known patterns in human learning data (e.g., substantial 
forgetting across delays, benefits of spacing). The dataset was 
downloaded from the Memphis Datashop repository. 


In the Andes dataset, 66 students learned physics using the Andes 
tutoring system, generating 345,536 observations. Participants 
were given feedback on their responses as well as solution hints. 
Additionally, participants were asked qualitative “reflective” 
questions after feedback (for more details, see [12]). 
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Figure 2. Changes in fit including differing numbers of additional 
clusters for Andes Physics data using covariance clustering (CC) or 
SPARFA-lite (SL). 


4. RESULTS 


Figures 1 and 2 show the result for the two datasets. The proportion 
of R-squared gain indicates the improvement in R-squared for the 
true clustered model compared to the random comparison R- 
squared model as a proportion of the random comparison R- 
squared model. Because of the result's preliminary nature, we have 
not been able to produce smoothed figures through cross- 
validation. However, the results consistently show beneficial 
effects. In general, both methods have similar accuracy. 


Both methods can achieve similar improvements via different 
parameters. However, it does appear that the efficacy of the 
methods differs somewhat across datasets. Covariance clustering 
found the best solution in the Andes dataset, with SPARFA-Lite 
having the best solution in the Cloze dataset. This preliminary result 
suggests that applying multiple approaches to the same dataset may 
be advisable, especially when the underlying domain structure is 
unknown. Different domain modeling algorithms may differ in 
their ability to detect this underlying domain structure. 


To understand better the results shown in Figures 1 and 2 we can 
query the model for the parameters for the cluster KCs to confirm 
that they are meaningful due to the structure of PFA. Normally we 
would expect the cluster KC coefficients for success and failure to 
be more different if the model was labeling real KCs since it is 
typically the case that successes predict future success more than 
failures. Indeed, test comparison shows exactly this pattern; for 
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example, considering the model with 10 components and 10 KCs 
in for the Physics data with covariance clustering applied, we see 
the success is .56 higher than failure. In contrast, for the 
randomized model, the value was .12 higher for success than failure 
(best explained as the overfitting we might expect for such a 
mechanism). 


5. DISCUSSION 


In the present paper, we described our ongoing work to automate 
both the process of searching for domain models and the search 
method (e.g., covariance clustering vs. SPARFA-Lite). Many 
approaches have been proposed to infer domain [3; 5; 14; 15; 19; 
20; 22], but there has been little comparison. However, comparison 
among approaches is important because their different underlying 
assumptions and limitations will interact with the learning domain's 
true underlying structure. For example, if the learning domain is 
calculus, various prerequisite skills from other branches of 
mathematics may be necessary (e.g., algebra, trigonometry). In 
other domains, learning one KC before another may enhance 
learning but not be required. Learning how to compute a sample's 
mean may facilitate learning to compute the median due to 
contrasting their different procedures. However, neither is a 
prerequisite to learn the other. Domains vary in the extent that 
learning one KC may transfer to another, and the researcher may 
not have strong theories a priori that could help constrain the KC 
model search. Thus, choosing one method with specific 
assumptions and limitations across different knowledge domains 
may be inadvisable and result in suboptimal KC model solutions. 


5.1 Future Plans 
5.1.1 Additional domain model methods 


There are several methods we hope to include in the system to 
analyze student data to produce the inference matrix, for example: 


dAFM - Developed at Berkeley by Pardos and Dadu [15] and 
shown to improve the Piech [21] deep knowledge tracing 
algorithm. This method is a deep learning model that uses 
backpropagation to infer a Q-matrix type representation with 
graded skill assignments instead of binary assignments. The 
authors show how the model is a continuous neural network 
generalization of the AFM model used in the LFA method [5]. 


Tensor factorization — We have also been working with 
implementations of tensor factorization. Tensors allow the solution 
to integrate multiple sources of data, including a representation of 
time in the sequence of practice. 


These methods may be more accurate because they allow the 
representation of sequence to capture order effects in the model that 
may be due to learning. However, another view might be that it 
makes it more vulnerable to the selection effects that led to 
particular practice sequences. In other words, domain model search 
algorithms that are sensitive to effects over time may be more likely 
to incorporate artifacts due to pedagogical decision rules (e.g., 
“drop item from practice after N successes”). For instance, in 
systems in which items are dropped from practice after a few 
successes (e.g., Assistments), the sequential order and temporal 
spacing will be different than in practice schemes in which items 
are not dropped from practice (e.g., [9]). In short, domain model 
extraction from datasets that were generated by an adaptive 
learning system will be influenced by the decision rules inherent to 
that system. 


5.1.2 An ensemble approach to address individual 
model search limitations 


We also intend to allow multiple approaches to be allowed within 
a single KC model development pipeline. For instance, approaches 
like dAFM have shown promise to improve KC models but require 
an initial KC model. However, this apparent limitation is only a 
problem if the goal is to find a single approach that resolves the 
problem of KC modeling. Instead, the goal can be reoriented 
towards finding the best ensemble and ordering of approaches that 
can be used in order to develop an optimal KC model. As an 
example, an optimal KC model may be created by making an initial 
model with SPARFA-Lite, followed by a final model using dAFM. 
Requiring a starter model is only a limitation if complementary 
approaches cannot be combined. 


5.1.3 Integrating with learner model development 
Learner models and domain models are strongly interdependent but 
frequently developed and refined independently. This separation 
probably limits progress on both fronts. Using relatively simple 
learner models when searching for improved domain models may 
lead to misleading results if the chosen learner model does not 
accurately represent learning, forgetting, transfer, and other 
important learning factors. Similarly, developing learner models 
without considering the chosen KC model's plausibility may lead 
to spurious results. Recently, we developed a framework to 
facilitate learner model development named Logistic Knowledge 
Tracing [18]. We aim to integrate automated KC model search and 
refinement into the LKT framework. 


5.1.4 Representing transfer does more than improve 
model fit 

Representing transfer among KCs can have significant pedagogical 
consequences that will not be apparent from model fit metrics (e.g., 
reduced RMSE, increased AUC). For instance, imagine a student is 
learning three items (A, B, and C). If the domain model considers 
A and B to be related because they share KC, practicing item A will 
influence both when and how much B is practiced. Depending on 
the strength of the transfer, practicing A may result in B being 
practiced being, being practiced before C, being practiced after C, 
or not being practiced until much later when forgetting has occurred 
(if the learner model assumes forgetting). The entire order of 
practice may change. 


Another issue is the efficiency effect of such transfer. Consider that 
if the three items are independent, students may practice all three 
as necessary for mastery. In contrast, if item A affects item B 
through a shared KC, it will increase or reduce the amount of 
practice needed for mastery of B, which can reduce costly 
overpractice. In short, accounting for transfer among KCs may 
greatly improve practice efficiency, which may not be apparent 
when comparing domain models in terms of model fit metrics (e.g., 
RMSE, AUC, AIC). Ultimately, comprehensive evaluation of new 
KC models requires simulations or experiments to determine their 
effects on how practice is scheduled within an adaptive learning 
system. This need comes from how a new KC model may interact 
with pedagogical decision rules (e.g., mastery learning) and learner 
models (e.g., BKT, PFA) within an adaptive learning system to 
change the sequence of practice (e.g., due to quantifying transfer 
among items differently). These changes to the sequence may have 
significant impacts on student learning. 
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ABSTRACT 


Offering students immediate, formative feedback when they 
are programming can increase students’ learning outcomes 
and self-efficacy. However, visual and interactive programs 
include dynamic user input and visual outputs that change 
over time, making it difficult to automatically assess stu- 
dents’ code with traditional functional tests to offer this 
feedback. In this work, we introduce Execution Trace Based 
Feature Engineering (ETF), a feature engineering approach 
that extracts sequential patterns from execution traces, which 
capture the runtime behavior of students’ code. We evalu- 
ated ETF on 162 students’ code snapshots from a Pong game 
assignment in an introductory programming course, on a 
challenging task to predict students’ success on fine-grained 
rubrics. We found that ETF achieves an average F score 
of 0.93 over 10 grading rubrics, which is 0.1—0.2 higher than 
a high-performing syntax-based code classification approach 
from prior work. These results show that ETF has strong 
potential to be used for code classification, to enable forma- 
tive feedback for students’ visual, interactive programs. 


Keywords 
execution traces, feature engineering, computer science ed- 
ucation, code classification, formative assessment 


1. INTRODUCTION 


Real-time, formative feedback promotes students’ learning 
gains and self-efficacy [7, 3, 12, 18]. To provide such forma- 
tive feedback in real-time, CS instructors commonly write 
test cases, allowing students to run their code against these 
test cases when programming [9, 5, 6, 4]. However, visual, 
interactive programming projects, such as creating apps and 
games [16], include dynamic user interactions, and visual 
outputs that change over time, making it challenging to use 
test cases to assess these programs [14, 13, 20]. 


In contrast to test case-based approaches, data-driven meth- 
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Figure 1: A horizontal n(2)-Gram (in yellow) and a 
vertical n(2)-Gram (in blue). 

ods allow instructors to offer formative feedback by grad- 
ing a smaller set of programs instead of writing test cases 
[11, 21, 10, 22]. These methods start with transforming 
code into input vectors using feature engineering, typi- 
cally by extracting syntax elements from an abstract syntax 
tree (AST), where nodes and their children correspond to 
specific code elements (e.g., if statements). However, when 
applying these syntax-based feature extraction techniques to 
classify programs, prior work showed mixed results, which 
are often not high enough to ensure the quality of student 
feedback [11, 1, 2]. Some prior work has used execution 
traces to classify students’ sorting programs based on their 
specific strategies, and has shown that execution-trace-based 
classification achieved higher accuracies than a syntax-based 
classification approach [8]. However, no prior work has con- 
ducted feature extraction on the execution trace of visual, 
interactive programs, which include dynamic user interac- 
tions and various changing outputs. In this work, we ex- 
tract features from execution traces that capture the runtime 
behavior of visual, interactive programs. We designed an ex- 
ecution trace-based feature engineering approach (ETF) to 
transform students’ source code into feature vectors, for clas- 
sification algorithms to build models based on rubric-based 
labels (e.g., the presence of a key-triggered movement). We 
evaluated ETF by classifying 162 students in-progress and 
submitted code snapshots. We found it to achieve high 
prediction performance with an average of 0.93 F1 score 
over 10 grading rubric items, which is 0.1—-0.2 higher than a 
high-performing syntax-based code classification approach. 
Our work has the following contributions: 1) We designed 
and implemented a novel, execution trace-based feature en- 
gineering (ETF) approach to extract temporal patterns in 
students’ visual, interactive programs; 2) We evaluated the 
ETF approach on students’ code snapshots for a widely- 
used, representative visual, interactive program assignment. 


2. RELATED WORK 


Syntax-based approaches extract patterns inside a code 
AST, use the presence or absence of a feature, or the count 
of the feature to generate input vectors. As an example, 
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b) Dumping Execution Traces 
Figure 2: Step 1: Generating Execution Traces. 

we explain a recently-applied AST n-Gram feature extrac- 
tion approach [17, 2] by making an analogy to Natural Lan- 
guage Processing (NLP): In NLP, an n-Gram with n = 1 
is a 1-Gram feature taken from each word; and an n-Gram 
feature takes a continuous sequence of n words to extract 
relationship between words (e.g., orders); in an AST, 1- 
Gram features represent each node in students’ code. And 
to extract structural relationships, Akram et al. [2] designed 
the use of n-Grams to represent n-length sequences of code, 
where a vertical n-Gram is created by a depth-first search 
of leaves; a horizontal n-Gram is created by a breadth- 
first search of all direct children of each AST nodes (e.g., 
in Figure 1) [2]. 1-Gram and n-Gram-based AST feature 
extraction approaches have both been applied for student 
code classification tasks. Compared to 1-Gram feature ex- 
tractions, prior work has shown that n-Grams provide more 
predictive features for code analysis. For example, Akram 
et al. used n-Grams with n ranging from 1 to 4 to extract 
features, and used a Gaussian Process model to infer scores 
on 642 students’ code pieces, in a block-based programming 
environment. This achieved an R-square of 0.94, higher than 
the 0.88 achieved by the baseline 1-Gram approach [2, 1]. 


On the other hand, recent work by Paafen et al. used 
execution-trace-based distance measures to classify pro- 
grams into different strategies (e.g., bubble sort v.s. Inser- 
tion sort), and found that execution- race-based classifica- 
tion achieving 90% accuracy, higher than the 80% accuracy 
achieved by syntax-based approaches [8]. This shows the 
potential of using execution traces to classify students’ pro- 
gramming code. 


3. THE ETF APPROACH 


ETF starts from collecting a set of students’ programming 
code, along with a class label for each piece of code, (i.e., pos- 
itive or negative). ETF is designed for programs that have 
the following properties: 1) respond to dynamic user in- 
puts (e.g., mouse, keyboard). 2) has object-specific pro- 
gram states, corresponding to visual output on the screen. 
3) Program behaviors can be a function of time; and can 
also change over time. 


An Example Assignment. As an example, consider the 
Pong assignment, which consists of a paddle sprite and a 
ball sprite. The ball moves around the stage [19], and a 
player can use the keyboard to control the up and down 
movement of the paddle to catch the running ball. If the 
paddle catches the ball, the player score increases; but if 
the paddle misses the ball and the ball hits the back wall, 
the game ends. ETF conducts feature engineering on such 
visual, interactive programs in four steps, described below. 


step ball 
window Move? 


Summary Trace: 
Turn? TouchSprite? 


119-120 WY {right,down} xX x 
120-121 {right, down} ¥ v {paddle} 
121-122 VW {left, down} x x 


row 1: {Move({right, down})} 


row 2: {Move( {right, down}), 
Turn, TouchSprite( {paddle} )} 


SN 


row 3: {Move( {left, down})} 


a) Step 2: Generating Summary Trace 


(row 1-2) 


[ {Move} | (row 1) [{Move}, {Move, Turn, TouchSprite}] 


{Move, Turn, TouchSprite} 
———_——— (row 2) 
| {Move} [row 3) (row 2-3) 


[{Move, Turn, TouchSprite}, {Move} ] 


1-Grams 2-Grams 
b) From Step 3: 1-Grams and 2-Grams 

Figure 3: Step 2&3: summarizing traces & generat- 
ing features. 
3.1 Step 1: Generating Execution Traces 
Visual, interactive programs include program states that 
can be represented as properties that change over time. 
For example, in Snap!, these properties can include: 1) 
Time: how much milliseconds has passed from the start 
of program execution; 2) inputs: including KeysDown (which 
key is pressing); MouseDown (if mouse pressing); MouseX and 
MouseY (x, y positions of mouse) 3) global variables: the 
names (Var.Name and values Var .Value) of global variables; 
4) sprite-specific properties: properties that are related to 
specific sprites, such as (x, y) (x, y coordinates); dir (di- 
rections); TouchSprite (which sprites the current sprite is 
touching); TouchEdge (which stage edge the sprite is touch- 
ing); size (sprite size); OffStage (whether the sprite is 
moved out of the stage); the names (Var.Name) and values 
Var .Value local variables. The dumped execution trace ta- 
bles use sprite names to label sprite-specific properties (e.g., 
to distinguish ball.x from paddle.x’). 


Systems such as Snap! and SCRATCH make use of step func- 
tions to update the above properties based on the current 
properties and the current user inputs [14, 19]. We instru- 
mented the step function in Snap! with a trace logging tool, 
so that with each Step, it adds a row in an execution trace 
table with the properties listed above, and dumps the trace 
table at the end of the execution. Figure 2 gives an example 
of a part of the execution trace table, in which one row logs 
one discrete Step created by the step function, with each 
entry maps to a property (i.e., a concrete program state). 


3.2 Step 2: Summarizing Traces 

ETF algorithm scans through the execution trace table in 
a sliding window of multiple steps (default as 2), apply a 
Trace Abstraction Function (TAF). The TAF looks for 
properties based on candidate properties, and only re- 
turns properties that were found in the sliding window as a 
summarized property set. A candidate property can be an 
abstract property, describing the changes between steps in 
the execution trace, such as movement and variable change; 
A candidate property can also be an original trace prop- 
erty which were already recorded in the execution trace. 
In each sliding window, the TAF function returns a sum- 
marized property set that includes all found properties, for 
example, in the step window 120-121 of Figure 3, all candi- 
date properties were found because there is a change in po- 


'ETF uses these properties to summarize trace and gener- 
ate features (Section 3.3). To allow comparison across stu- 
dents, sprites need to have consistent labels across student 
programs. 
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sition (Move), a change in direction (Turn), and a non-empty 
TouchSprite column in Step 121. This creates a summarized 
property set {Move, Turn, TouchSprite}, shown in Row 2 of 
the summary trace (Figure 3). In addition, TAF’s candidate 
properties also include possible types of parameters, which 
describe detailed information of the property. 


For Pong, ETF used 9 types of candidate properties. Ex- 
cept 2 program state properties: (KeysDown and ChangeVar), 
the rest 7 are sprite-specific properties and are labeled with 
the sprite names (e.g., Paddle.Move). Among the 9 can- 
didate properties, 4 were original trace properties, directly 
returned when the corresponding property in the last step of 
the sliding window is non-empty: KeysDown, TouchSprite, 
TouchEdge, and OffStage, using the same parameters with 
the corresponding execution trace entry at the last step of 
the sliding window, explained in Section 3.1. Others are 5 
abstract properties, that only checks if a property changes 
between the first and last step of the sliding window (and 
if has middle, omit those middle ones). 1) Move(<-,—,7, 1). 
Returned when a sprite position changes. Its parameters are 
the direction toward which the sprite moves. 2) Turn. Re- 
turned when a sprite changes direction. 3) ChangeSize(+, 
-). Returned when the Sprite changes its size to bigger 
(+) or smaller (-). 4) ChangeVar(variable names(+, -)). 
Returned when a global variable’s value has been changed 
to bigger (+) or smaller (-). 5) ChangeLocalVar(variable 
names (+, -)). Returned when a local variable’s value has 
been changed to bigger (+) or smaller (-). 


3.3 Step 3: Generating n-Gram-based Features 


ETF next transforms summary trace created by Step 2 into 
a set of features using n-Gram-based approach, where an n- 
Gram takes a contiguous sequence of n items in data (Fig- 
ure 3). ETF extracts features of 4 types: 1) 1-Grams, 
extracting simultaneous behaviors, taken from each row 
of the summary trace. 2) 2-Grams, connecting adjacent 1- 
Grams sequentially; 3) Power Sets. We extract n-Grams 
of not only the full property set in each row of the sum- 
mary trace, but also of subsets of the property set, such 
as the 2-set of just Move and Turn from t /-Gram {Move, 
Turn}. When constructing power sets for 2-Grams, we ap- 
ply the power set on the types of properties that are possible 
in this 2-Gram. 4) For all the n-Grams extracted above, 
we keep both non-parameterized n-Grams, where we do 
not record the parameters, as well as parameterized n- 
Grams, where each property would include its parameters 
when they were logged in the summary trace. Next, ETF 
collects distinct features from all students’ feature sets as the 
full feature set, which consists of distinct features from all 
students. 


3.4 Step 4: Filtering Features 

Merging duplicate features and removing rare fea- 
tures. Based on the full feature set generated from Step 
3, if features have the exact same distribution among pro- 
grams, the ETF algorithm then merges these features as one 
feature; and it calculates the support of each feature based 
on the proportion of student programs that include this fea- 
ture, and remove features that have support lower than a 
certain threshold, determined by a hyperparameter tuning 
process, described in Section 4.2. 


=> 


Generating # vectors. After generating, merging dupli- 
cates, and removing rare features, we use the resulting fea- 
ture set as the independent variables, and for each student 
program, we use l as representing the presence of a feature 
in the student program (i.e. the n-Gram appeared at least 
once in their abstracted execution trace), 0 as the absence 
of the feature, and generate 0-1 digitized # vector for each 
student’s code snapshot, which is used as vector input for a 
classification model. 


4. EVALUATION 


We investigate our research question: How accurately 
does ETF perform rubric-based code classification of 
students in-progress and submitted code, and how 
does this compare to syntax-based approaches?. We 
first a) compare performance of ETF features and syntax- 
based features across models; and next b) compare ETF 
features and syntax-based features across rubrics on a 
fixed model. Our analysis of a) and b) follows the same 
procedure, where we started by generating ETF and syntax- 
based features separately ( Section 4.1). We next performed 
the same feature filtering, training and evaluation procedure 
on the features we created (Section 4.2). 


Dataset. We evaluated ETF on 42 students’ 162 code snap- 
shots for a Pong game assignment, sampling student code 
snapshots at 10 minutes (42), 20 minutes (40) and 30 min- 
utes (38) of work, as well as their final submissions (42), on 
10 target evaluation items: key_up/down, upper /lower_bound, 
space_start, edge_bounce, paddle_bounce, paddle_score, re- 
set_score, reset_ball. A detailed description and the preva- 
lence of the data can be found in our prior work [19]. 


4.1 Generating Features 

Generating ETF features. We used the procedure de- 
scribed in the Step 1 — 3 (Section 3.1-3.3) of the ETF ap- 
proach to generate ETF features. We first automatically 
run the program based on inputs. To ensure coverage, we 
used the same inputs (up/down arrow key, follow/evade ball) 
as our in prior work [19], defined using SNAPCHECK. For 
each program snapshot, we re-executed the program 5 times. 
Each run of student programs generated one execution table. 


Generating AST n-Gram and 1-Gram Features. We 
first compare ETF it with a representative, syntax-based 
feature extraction approach that has performed well in prior 
evaluations by using the AST n-Gram feature extraction ap- 
proach [2]. Similar to Akram et al.’s work, we extracted all 
n-Grams from all ASTs, using n = 1 to 5 for vertical n- 
Grams, and n = 1 to 4 for horizontal n-Grams (explained 
in Section 2). Similar to many AST feature extraction ap- 
proaches [10, 11, 22], we used a single label for all literals 
(literal). 


4.2 Feature Filtering & Evaluation 
We applied the same feature filtering and evaluation to the 
ETF, AST n-Gram, and AST 1/-Gram features: 


Feature Filtering. For fairness of comparison, after col- 
lecting features, we used the Step 4 from the ETF algo- 
rithm to filter features for all ETF, AST n-Gram, and AST 
1-Gram features. For each type of the three features, we 
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Table 1: F1 scores of AST 1-Grams, AST n-Grams, 
and ETF Features, over different models. 


AST AST ETF 

1-Grams n-Grams_ Features 

Logistic Regression 0.771 0.779 0.932 
AdaBoost 0.78 0.78 0.922 
Random Forest 0.763 0.773 0.926 
MLP 0.764 0.771 0.908 

Gaussian Process 0.739 0.728 0.923 
SVM 0.759 0.771 0.93 


started by using ETF to automatically merge duplicate fea- 
tures (Step 4.a), and remove features that have support 
smaller than a certain threshold in the training set (Step 
4.b). The threshold is set as a hyperparameter (tuned as 
described below). 


Classification Models. To ensure that our comparison 
was not model-specific, we used 6 different models on the 
feature set: Logistic Regression, AdaBoost, Random Forest, 
Multi-layer perceptron (MLP), Gaussian Process, and SVM. 
Among them, the Gaussian Process model with an RBF ker- 
nel was also employed by Akram et al., and has shown to be 
the best performing model in the rubric-based performance 
inference task that they have applied [2]. 


Training & Evaluation. We employed 10-fold cross-validation 


to evaluate how accurately these different features predict 
the rubric-based performance. Within each round of cross- 
validation, we used another 2-fold cross-validation to tune 
the hyperparameters (i.e. nested cross-validation [15]). For 
all models, we included a minimum feature support thresh- 
old hyperparameter, T', below which we exclude the feature 
(e.g. ETF feature or AST n-Gram feature) from the final 
feature set, with the minimum support threshold as a hyper- 
parameter, tuned based on 5 values: {0,5%, 10%, 15%, 20%}. 
Additionally, since different feature extraction approaches 
may perform best with different values of model-specific 
hyperparameters, we also tuned hyperparameters for each 
classification models, based on the following values: Lo- 
gistic regression: with penalty in {L1, L2}; Random For- 
est: with n_estimaters (i.e., number of trees in the forest) 
in {100, 200, 300, 400, 500}; AdaBoost: with learning rate in 
{0.01, 0.1, 1}; MZP: with learning rate in {0.001, 0.01, 0.1}; 
> SVM: We used a linear kernel, with the regularization pa- 
rameter (C) in {0.01, 0.1, 1, 10, 100}; Gaussian Process mod- 
els optimize kernel hyperparameters during model fitting, 
we therefore did not tune hyperparameters for the Gaus- 
sian Process classifier. The values of the minimum feature 
support threshold and the model-specific hyperparameters 
were determined by their F) scores in the nested 2-fold cross- 
validation, based on a grid search on 5*#(model-specific hy- 
perparameter values) possible types of hyperparameter com- 
binations, during each round of the 10-fold cross-validation. 
Since many of our target labels are imbalanced, the accu- 
racy score offers less information on how well our model 
performs in predicting target labels. We therefore use F} 
scores to tune hyperparameters. To ensure that data from a 
given student was not contained in both training and testing 
sets, all cross-validation splits were done on the 42 students 
(instead of on the 162 snapshots). 


4.2.1 Results 


Comparison Across Models. Using each of the 6 mod- 
els described above, we predict students’ rubric-based per- 
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Figure 4: The F, score (F1), Precision (P), Recall 
(R) and Accuracy (A) of ETF, n-Gram, and 1-Gram 
features on each rubric, using an SVM model. x-axis 
starts from F, — 0.5. 

formance, calculated its F; score among the 162 students’ 
data using 10-fold cross validation, creating one F, score 
for each rubric. We next averaged F' scores for each clas- 
sifier across all rubrics, shown in Table 1. We saw that the 
AST n-Gram approach performed similar to the 1-Gram 
approach (5 in 6 cases), and that ETF features generated 
F, scores that were consistently 0.14 to 0.2 higher than the 
AST n-Gram features, showing that all classifiers benefited 
from the execution-trace-based information extracted by the 
ETF features, with overall F; scores between 0.9 and 0.93. 
This result shows potential for us to make use of ETF fea- 
tures to correctly analyze students’ current progress, which 
should enable automated, formative feedback in future work, 
to help a student who is stuck in the middle of programming. 


Performance Across Rubrics. We next investigate the 
performance of the three feature extraction approaches across 
rubrics. Since all model show similar trends, we select SVM 
and present performance on all rubrics in Figure 4. We found 
1) The naive AST 1-Gram features had relatively lower Fi 
scores on rubrics that had less prevalence in data (e.g., re- 
set_score, reset_ball); 2) Comparing to AST 1-Gram, the 
AST n-Gram features produced higher F; scores for pad- 
dle_score, reset_score, reset_ball, showing that AST n-Gram 
extracted more useful feature for these three rubric items. 
However, the ETF features performed relatively well across 
all rubrics, with its F, scores ranging from 0.9 to 0.99, show- 
ing that ETF features have strong potential to enable for- 
mative feedback on a variety of fine-grained, specific rubrics. 


In conclusion, we presented a novel, effective approach that 
extracted useful features from execution traces (ETF), lead- 
ing to high predictive accuracy. Our results show strong 
potential for using ETF to monitor student progress and 
offer automated, formative feedback. 
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ABSTRACT 


Online education technologies, such as intelligent tutoring 
systems, have garnered popularity for their automation. Wh- 
ether it be automated support systems for teachers (grading, 
feedback, summary statistics, etc.) or support systems for 
students (hints, common wrong answer messages, scaffold- 
ing), these systems have built a well rounded support sys- 
tem for both students and teachers alike. The automation 
of these online educational technologies, such as intelligent 
tutoring systems, have often been limited to questions with 
well structured answers such as multiple choice or fill in the 
blank. Recently, these systems have begun adopting support 
for a more diverse set of question types. More specifically, 
open response questions. A common tool for developing au- 
tomated open response tools, such as automated grading or 
automated feedback, are pre-trained word embeddings. Re- 
cent studies have shown that there is an underlying bias 
within the text these were trained on. This research aims 
to identify what level of unfairness may lie within machine 
learned algorithms which utilize pre-trained word embed- 
dings. We attempt to identify if our ability to predict scores 
for open response questions vary for different groups of stu- 
dent answers. For instance, whether a student who uses 
fractions as opposed to decimals. By performing a simu- 
lated study, we are able to identify the potential unfairness 
within our machine learned models with pre-trained word 
embeddings. 


Keywords 
Natural Language Processing, Unfairness, Deep Learning, 
Word Embeddings, Pre-Trained Word Embeddings, Simu- 
lated Study 


1. INTRODUCTION 


In recent years, natural language processing (NLP) has been 
at the forefront of machine learning in multiple fields. Lin- 
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guistics provides another source of information outside the 
standard data from user logs. Instead of relying on cor- 
relational assumptions from this data, inferences can be de- 
duced directly from the users linguistics. While utilizing lin- 
guistics in education isn’t genuine, modern machine learning 
and natural language processing has helped to automate the 
analysis and provides effective tools for learning. 


The development of more advanced deep learning has brought 
a deeper semantic understanding of words within these lin- 
guistical models. The emergence of word embeddings were 
an important development in machine learning and NLP, but 
the publishing of publicly available pre-trained word embed- 
dings, such as such as GloVe [21] or Wikipedia or Word2Vec 
[19], provided researchers with a powerful tool for optimizing 
algorithms with linguistics. While word embeddings were 
powerful for studies within areas such as MOOCS (i.e [14] 
[20]), smaller studies, with less robust linguistic data, were 
unable to utilize this modern approach for semantic rela- 
tionship of words. 


Since research has shown that some of the semantic mean- 
ings inferred from pre-trained word embeddings can elicit 
undesirable biases [2], the major question then becomes, 
does this underlying bias suggest the algorithm or predic- 
tive model will make unfair decisions? For instance, if an 
algorithm utilizes linguistics and NLP with pre-trained word 
embeddings will the predictions be unfairly made from those 
underlying biases. Our research attempts to explore: 


1. Whether, through 3 simulated studies, the format a 
student writes an answer (i.e. fractions vs. decimals) 
effect the scoring model and potentially elicit unfair 
scoring? 


2. What effect, through 3 simulated studies, if any, do 
‘distractor’ words have on the unfairness? 


3. Whether or not underlying bias in pre-trained word 
embeddings can lead to unfairness in open response 
scoring models in middle school mathematics? 


2. BACKGROUND 

2.1 Intelligent Tutoring Systems 

In recent years, online educational technologies have been on 
the forefront of learning for students. A common online ed- 
ucational technology, intelligent tutoring systems (ITS) [4], 
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has been prevalent in education for many years. Some of 
the most common ITS are ASSISTments[11], McGraw Hill’s 
ALEKS™ and/or Carnegie Learning’s Cognitive Tutor™. 
Through the use of both machine learning and software en- 
gineering, these systems have been shown to be effective at 
increasing the scores of students with end of the year stan- 
dardized math exams[25] and the effects of their intelligent 
tutoring closely resembles the effect face to face tutoring has 
on students[31]. Other ITS, such as AutoTutor[9],have at- 
tempted to resemble the face to face tutoring more directly 
by developing automated conversations and dialogues be- 
tween students and ITS [9]. However, most of the support 
and benefits of these ITS have been limited to questions with 
structured answers (i.e. multiple choice or fill in the blank 
questions). 


2.2 Automation of Intelligent Tutoring Systems 
Automated support of ITS is a draw for many teachers; one 
study noted that many utilize multiple choice questions for 
the efficiency and accuracy of grading [26]. However, since 
most of the automation is limited to questions with struc- 
tured answers, the content which teachers provide is lim- 
ited. Studies have looked to utilize NLP to automatically 
evaluate work or questions which require a student’s unique 
linguistics (i.e. open response questions, or essays) includ- 
ing [29][28][24][1][7][30][17]. While most of this research has 
been primarily focused on content outside of mathematics, 
our previous research, [6], looked to help teachers diversify 
the content which they provide students in middle school 
mathematics by utilizing traditional and modern NLP to 
develop an automated scoring model for open response mid- 
dle school mathematics questions. A more diverse set of 
question types can be beneficial to students and can elicit 
differing levels of cognition, as studies [18][15] note. 


2.3 Natural Language Processing 

Towards the goal of automating open response questions, 
or any linguistical/NLP prediction task, the major task is 
in how to numerically represent words thus that a machine 
learned algorithm can generate an accurate prediction. One 
of the more simplistic approaches utilizes the frequencies of 
each unique word within the corpus, whats commonly known 
as a Bag of Words approach. While undoubtedly easy to in- 
terpret and not computationally intensive, this approach has 
been utilized in studies such as [13] and is the foundation 
of more advanced approaches such as[27][10][22]. This has 
evolved into utilizing deep learning to generate word embed- 
dings such as GloVe[21] and Word2Vec[19]. 


2.4 Pre-Trained Models 


Embeddings are only as powerful as the data their train 
on. Not all researchers have robust corpuses, thus embed- 
dings can be misleading. Pre-trained embeddings, such as 
GloVe or Word2Vec, publish their own embeddings gener- 
ated from Wikipedia and GoogleNews. As these pre-trained 
word embeddings have grown in popularity, word embed- 
dings have expanded to utilize bidirectional encoder repre- 
sentations from transformers (referred to as BERT[5]) to 
create pre-trained word embeddings, as well. Similarly, this 
has evolved from word level embeddings to pre-trained sen- 
tence and document level embeddings[23][3][16]. 


2.5 Fairness 

When it comes to linguistics, the way someone speaks, the 
way someone articulates can be unique to themselves. Sim- 
ilarly, the way someone writes is personal to themselves and 
specific to their topic. So when algorithms are being pre- 
trained on data which isn’t the researchers own data, there 
are questions to be asked. Research[2], has been able to iden- 
tify some potentially harmful semantic relationships present 
in common pre-trained word embeddings. For instance, [2] 
was able to identify that Google’s pre-trained Word2Vec on 
GoogleNews elicited some harmful stereotypes. There in lies 
the important question, if we omit variables which could 
cause unfairness in the automated scoring, are we continu- 
ing to avoid unfairness if we utilize pre-trained word embed- 
dings. To identify this, we utilized Absolute Between-ROC 
Area (ABROCA) [8]. 


3. STUDY 1: SIMULATION STUDY 


This research developed a simulated study to attempt to 
identify if pre-trained word embeddings are utilized within 
an automated scoring model for open response answers, do 
they influence the model to make unfair predictions. An 
example of this would be if a group of students state their 
answer with a fraction and surrounding text, does the pre- 
dictive model generate scores similarly for those students 
that use decimals along with surrounding text? Through 
this simulated study, we are able to gain a deeper insight 
into what/if any unfair scoring occurs when utilizing the 
pre-trained GloVe word embeddings trained on Wikipedia 
between groups. 


There are 3 studies within this simulated study to help 
achieve this goal. First, we develop answers which contain 
differing distributions of answers which contain fractions and 
decimals and generate the ABROCA value at the differing 
distributions. Second, we attempt to see if decimals and 
fractions alone generate differing ABROCA values. Third, 
we attempt to see if we replace decimals in the text with 
unknown tokens (more reliance on distractor words), do the 
ABROCA values differ at differing distributions? These 
studies will help provide deeper insight into the potential 
unfairness an automated scoring model can be producing 
when utilizing pre-trained word embeddings 


3.1 Data Generation 

At the foundation of this simulated study is the generation of 
the student dataset.The generation was split into two facets, 
the training dataset student answers and the test set student 
answers. This was performed such that the model would not 
be able to have any identical answers between the training 
set and the test set. Essentially, that the predictions aren’t 
being made because the model has already seen that exact 
series of embeddings associated with a certain score. 


Towards simulating authentic student answers, the gener- 
ation of the corpus was founded on the goal of utilizing 
random selection. For the training set, as Table 2 shows 
(see Appendix A), there are 4 different length student an- 
swers in this corpus. There are answers which are 6, 5, 4 
and 3 word length answers. The generation of the student 
answers can be surmised into 4 steps and visualized with 
Table 2 (see Appendix A). First, select whether it will be 
a student answer which uses decimals or fractions. Then, 
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randomly select what length the answer is. Once a length 
is randomly selected, another random selection is made be- 
tween the two structures (i.e. ‘Structure’ within Table 2 in 
Appendix A). Finally, randomly select text from Fill “1” and 
Fill “2” Fractions or Fill “2” Decimal to fill the identifiers ‘1’ 
and ‘2’. 


This is the same approach that is utilized within the test set 
corpus generation as well. Table 3 (see Appendix A) shows 
that however there are different structures and phrases to se- 
lect from than the training. Allowing for variability between 
test and training; thus, guaranteeing that a answer structure 
used in training isn’t the same as in the test. From this, we 
can ascertain the model’s predictions were not only based 
on identical phrases it has seen. 


3.2 Methodology 


This study sets out to identify if an automated scoring model 
for open response questions, which utilizes pre-trained word 
embeddings, elicit unfair scoring. With the answers gener- 
ated, we set out to sample these simulated answers such that 
there is a balance of student answers which utilize fractions 
and decimals. The training set is comprised of student an- 
swers drawn from the pool of simulated answers containing 
fractions (considered as Group A), as well as answers sam- 
pled from similar student answers containing fractions and 
decimals as determined by a defined proportion threshold 
(considered as Group B). 


A threshold was set for selecting decimals and fractions to 
control the balance of answers. This lends itself to our goal 
of being able to identify whether or not the format a student 
writes an answer, i.e. using factions vs. decimals, effects our 
ability to score student open response answers. Thus, with 
ABROCA, fairness can be identified at each threshold. 


For the test set, a similar approach is taken. So the training 
and test set have both Group A answers which are distractor 
words and fractions, and Group B answers which have dis- 
tractor words and a proportion of fractions/decimals (based 
on the same threshold for both training and test data). 


To improve the reliability of the results, we re-sample/re- 
select the test dataset 10 times and evaluate the model’s 
ability to score an open response answer. This form of cross 
validation allows us to see if the ability to predict the score 
was only for that unique set of words, or was the performance 
consistent across multiple iterations. 


All of the studies will incorporate a Long Short Term Mem- 
ory (LSTM) [12] model which utilizes the pre-trained word 
embeddings to automatically score open response answers 
and ABROCA is calculated. This is then used to run 3 
studies. First, when incrementally increasing decimals be- 
ing used in Group B, does the LSTM scoring model become 
more unfair? Second, whether or not fractions or decimals 
are the culprit of the potential unfairness in the automated 
scoring model by having answers just be fractions or deci- 
mals without distractor words? Third, when incrementally 
increasing unrecognized words (referred to as gibberish) be- 
ing used in Group B, does the LSTM scoring model become 
more unfair? So does an imbalance in recognized words 
cause more unfairness between groups? 


3.3 Results 

In the end, Figure 1a showed that with the simulated study, 
when there is an increase in the proportion of decimals in 
Group B, there does not appear to be unfairness in the way 
Group A and Group B are evaluated. This is evident from 
the scale of the y-axis of Figure la, the ABROCA falls be- 
tweem 0.02 and around 0.04. 


The second study, which there were only decimals and frac- 
tions in the test set (no distractor words), stayed constant 
at a ABROCA value of 0. The model was not unfair, it was 
able to equally evaluate both groups even when just isolated 
fractions and decimals. 


The final simulation study, managed to show that increasing 
the imbalance between recognized and unrecognized tokens 
between groups increased the unfairness (ABROCA near 
0.18). Figure 1b shows that the ABROCA score does indeed 
increase with more unrecognizable words within GloVe’s pre- 
trained word embeddings. It should be noted that Table 1 
shows some of the phrases used in the generated student an- 
swers were commonly associated with more correct answers 


Table 1: Sample of Phrases and Their Associated Avg. Score 


Generated Phrases Avg. Score 

my answer is 0.718750 

i picked 0.622222 

i guess the answer is | 0.600000 

i think it is 0.600000 

i think the answer is | 0.590909 

i worked out 0.585366 
In the end, these simulated studies proved the largest risk 
for unfairness exists when there is differential coverage of 


answer-related tokens within applied methods utilizing pre- 
trained NLP embedding methods. So when answers con- 
sist of equally recognizable words within GloVe’s pre-trained 
word embeddings, there’s unlikely to be unfairness in the 
grading. There wasn’t evidence that the inherent bias built 
into the pre-trained word embeddings elicited more unfair 
scoring of student answers in, in terms of this simulated 
study. But if there are unbalanced recognizable words and 
tokens in the student answers, attention needs to be paid to 
potential unfairness in the automated scoring. 


4. STUDY 2: FAIRNESS IN REAL CONTEXTS 


While a simulation study is powerful on its own, it is diffi- 
cult to recreate authentic student data. For the final over- 
all study of this research, we look to once again utilize 
ABROCA to identify if our own algorithm, trained on gen- 
uine student open response answers within ASSISTments, is 
unfair in its grading of women and men. 


4.1 Data 


The data consists of 141,612 graded authentic student open 
response answers across 2,042 unique problems. There were 
a total of 25,069 unique students who answered and 891 
teachers graded those answers. Lastly, the scoring. This 
was performed on a 5 point scale, where students receiving 
a 4 is a perfect score. 
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(a) Study 1: ABROCA Values at Incremental 
Fraction/Decimal Thresholds 


It should be noted, to be able to perform the fairness analysis 
using ABROCA, gender was inferred. This performed by 
cross checking names with the census data. If the name was 
found only on the women or only on the men’s list, it was 
labeled as such. In any names fell into multiple genders, it 
was labeled as unknown and excluded from this analysis. 


4.2 Methodology and Results 


Towards developing our predictions, we utilized another pre- 
trained algorithm, mentioned earlier, called SBERT. This is 
a pre-trained sentence embedding algorithm which allowed 
us to generate a single vector representation of each student 
answer. We then utilize a Canberra distance to identify 
which student answers are the most similar. Whichever was 
the most similar, that was the score we would assign. This 
approach managed to out do our previous models [6]. 


While utilizing, once again, ABROCA to identify potential 
unfairness, we apply this to our algorithm. We were able to 
show that our SBERT model with Canberra distance man- 
ages to fairly score both Male and Female student open re- 
sponse answers. Our model managed an ABROCA of 0.007, 
which is quite small, suggesting that our algorithm is indeed 
scoring fairly across these groups. 


5. LIMITATIONS AND FUTURE WORK 


While there were indications of unfairness in cases where 
there were unbalanced identifiable tokens within the student 
open response answers, this analysis is strictly middle school 
mathematics. This type of analysis would need to be ap- 
plied to additional datasets to get a broader understanding 
of the potential unfairness in other subjects and age ranges. 
In terms of our analysis of our SBERT model for scoring 
student open response answers, while there wasn’t unfair- 
ness identified, more work needs to be done to explore the 
embeddings themselves. Pre-trained word embeddings have 
been shown to have bias built in, but what bias is present 
in the pre-trained sentence embeddings? This is a question 
we look to explore further. 


6. CONCLUSION 


Overall, this study set out to run a simulated study to 
help identify potential unfairness within models utilizing 
pre-trained word embeddings. While there is bias present 
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(b) Study 3: ABROCA Values at Incremental 
Fraction/Unknown Words Thresholds 


in the embeddings themselves, our simulated study didn’t 
show this bias causing unfair scoring. However, our analy- 
sis did show that when developing models with pre-trained 
embeddings, unfairness can begin to occur when there is an 
imbalance of recognized tokens in the student answers. More 
specifically, our simulated study showed that when groups 
within the data use differing levels of recognized tokens, it 
increases the chance for unfair scoring. 


While our simulated study showed how unfairness can present 
itself within a scoring model, our model on authentic student 
data did not show this unfairness. We were able to conduct 
an analysis of our model with ABROCA to compare our 
performance scoring identified male and female students. 


In the end, we were able to utilize a simulated study to help 
identify potential unfairness in automated scoring models 
which utilize pre-trained word embeddings. Its been widely 
noted that those embeddings have bias built in, but our 
simulated study couldn’t show an unfairness in the scoring of 
differing groups of simulated student answers. However, this 
study did suggest that there is a notable risk to fairness in 
cases where there are differences in the number of words that 
are unrecognized by pre-trained models across populations. 
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ABSTRACT 


Executive functions (EF) are a set of psychological constructs 
defined as goal-directed cognitive processes. Traditional EF tests 
are reliable, but they are not able to detect EF in real-time. They 
cause a test effect if implemented multiple times. In contrast, 
learning games have the potential to obtain a real-time, unobtrusive 
measurement of EF. In this study, we analyzed log data collected 
from a game designed to train the EF sub-skill of shifting. We 
engineered theory-based game-level and level-specific features 
from log data. Using these features, we built prediction models with 
students’ accuracy and reaction time during play to predict their 
standard measure of the EF shifting skill during the post-test and 
delayed post-test as well as to predict learning gains. Our model 
that predicts the post score has a correlation of 0.322 and that for 
the delayed post score is 0.303. The findings suggest that theory- 
based feature engineering and varying levels of granularity are two 
promising directions for cognitive skills prediction under the goal 
of game-based assessment. Also, accuracy, reaction time, and 
player progression are important features. 


Keywords 


Prediction, Game-Based Assessment, Learning Games, Executive 
Functions, Cognitive Skills. 


1. INTRODUCTION 


Executive functions (EF) are defined as “cognitive processes used 
for effortful, controlled, and goal-directed thinking and behavior” 
(29, 3, 4]. The unity/diversity model [24] views EF as consisting of 
related yet separable skills, which include updating, shifting (also 
termed cognitive flexibility), and inhibition. EF plays an important 
role in cognitive development and is associated with academic 
success [6], metacognitive skills [7], science learning [15], and 
language acquisition skills [10]. 


Game-based assessments allow educators to assess students’ 
learning while they are playing a game and thus in a manner that 
can be highly efficient, fast, and entirely unobtrusive. Using games 
as assessments creates a context in which learners are likely to be 
highly engaged, which may optimally reflect their abilities [16, 28]. 
Using log data from digital games to evaluate learning is sometimes 
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referred to as a “stealth assessment” [20] and has been used in the 
past decade to assess complex skills, such as creativity [33] and 
problem-solving, [34] based on log data. Log data collected during 
gameplay provides a record of student behaviors associated with 
EF and can be used for the prediction of EF [25]; however, is it 
possible to use log data collected from a game designed to train EF 
to measure EF and to develop a framework for game-based 
assessment? 


Past studies of game-based assessments have focused on complex 
thinking skills, such as problem-solving [34]; however, there are 
constraints of game-based assessments of EF. First, it is necessary 
to determine the granularity or time scale for which we can detect 
students’ EF in log data. Second, we need to separate log data 
related to EF training from log data related to other aspects of play 
to achieve a high performance of models. Third, we need to 
generate theory-based features relevant to EF skills. Accuracy and 
reaction time have been identified as indicators of EF [8]. 


This paper aims to provide proof of the concept for game-based 
assessments of the EF sub-skill of shifting. Shifting is one 
dimension of EF defined as the ability to switch attention between 
different “tasks, operations, or mental sets” [21]. The research 
questions include: 


1. How do students’ gameplay data predict their executive 
functions during a post-test and a delayed post-test? 


2. Which features, including accuracy or reaction time, are 
important for predicting EF in games? 


2. RELATED WORK 


2.1 Games for EF Training and 


Measurement 

Sustained and active engagement is widely thought to be critical for 
cognitive skills training games to be effective [2]. Incorporating 
gamified design features is one well-established mechanism for 
promoting meaningful engagement [9]. 


Digital training games can not only enhance EF [22, 27] but can 
also be used as a reliable means for measuring EF and other 
cognitive skills. For example, past work has examined the design 
and validation of computerized tools for measuring working 
memory capacity [21]. Previous research has also examined the use 
of a digital game for the detection of executive functions validated 
by a task for medical purposes in older adults using computational 
modeling [12]. Further work is needed to validate game-based 
measures of cognitive skills [35], especially those that are sensitive 
enough to detect variations among neurotypical individuals and that 
are appropriate for child and adolescent populations. 
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2.2 Game-Based Assessment 


In previous research, the analysis of log data as a means of a 
formative assessment has yielded promising findings [11] and has 
been used for predicting a variety of cognitive and behavioral 
constructs, including quitting [19], knowledge [1], computational 
thinking [30], persistence [26], and implicit learning [31]. 


Evidence-Centered Design (ECD) [23] has been used effectively to 
develop game-based assessments in contexts that teach specific 
knowledge domains [14]. According to ECD, an assessment 
framework should take these models into account: 


e Task model: Which actions is the learner taking within 
the system? 

e =Evidence model: Which features (e.g., from log data) can 
be used as evidence of learner actions? 

@ Competency model: How are these features associated 
with a set of standards or criteria that demonstrate 
effective learning has taken place? 


Accounting for these three ECD models is helpful for feature 
engineering and predictive modeling. In game-based learning 
contexts that teach knowledge and skills, connecting a task model 
to an evidence model should be relatively straightforward given 
that the log data provides a detailed record of the learner’s action 
sequence. Unlike game-based assessments of knowledge domains, 
where standards are clearly defined and may be validated by an 
expert review of content, in cognitive skills training game-based 
contexts, further work must be done to align an evidence model 
with a competency model. Accuracy and reaction time have been 
identified as two major aspects of an EF measurement [24], with 
evidence suggesting they each contributes uniquely to EF 
performance among children [8] and adolescents [5]. Yet, the way 
to distinguish the nuanced forms of accuracy and reaction time at 
varying levels of granularity and the way to combine them for 
predictive modeling within a game are currently unclear. 


3. DATASET 


3.1 Game Design 

All You Can E.T. (AYCET) [17] is a game that trains the EF sub- 
skill of shifting. Its early prototype, The Alien Game, has been 
shown to improve EF after 1.5 hours of play for high school 
students [16] and two hours for college students [27]. In the current 
study, we used the “hot” version of AYCET, a version that 
maximizes the playfulness of the game. As Figure 1 shows, a player 
is asked to feed aliens with the appropriate food based on a certain 
tule. The rule changes multiple times at each level, thereby 
requiring the player to shift. As the player progresses in the game, 
the rule becomes more complicated. 


Figure 1. Feeding aliens and instructions for a rule. 


3.2 Participants 

Participants were recruited from three middle schools and two high 
schools in urban school districts in the Northeastern United States. 
They completed the study during non-instructional time at their 


schools. Among the 448 students who consented, 137 students were 
strategically randomly assigned to one of the three conditions to 
play AYCET throughout the study. Of those, 56 were removed 
because they demonstrated off-task behaviors, and thus the log data 
could not reflect their true ability. This resulted in an analytic 
sample of 81. Details of participant removal are discussed in the 
Data Cleaning subsection. 


The 81 participants (Mage = 13.9 years, SDage = 1.6, 46.1% female) 
included 39 in grade 7, 18 in grade 8, and 21 in grade 9. They 
reported a culturally and linguistically diverse background. Among 
them, 51.3% reported speaking Spanish at home, while 47.4% 
reported English and 1.3% Mandarin. As for ethnicity, 78.2% were 
Hispanic/Latino, 1.3% were Asian, 17.8% reported two or more 
ethnicities, 1.3% reported another ethnicity but did not specify, and 
1.3% did not know. A few participants did not report their 
demographic information. 


3.3 Study Procedure and Data Collection 


The four-week intervention was conducted at the participating 
schools. Before gameplay, students completed a pretest. Then, they 
played the game for four sessions, each of which took about 30-40 
minutes. The cumulative amount of time of play was 2-3 hours. 
After play, students completed a post-test, and an additional 4-8 
weeks later, they completed a delayed post-test. The EF sub-skill 
of shifting was measured by the Dimensional Change Card Sorting 
(DCCS) task [36] in the pretest, post-test, and delayed post-test. 


The log data consisted of 144,187 data points or actions, recording 
whether students fed each target (“alien”) correctly or not and the 
reaction time for each target. This means that each alien required 
one action from the student. In this study, students played levels 1- 
30. There are 30-80 aliens per level. 


In this study, each session began a few levels back from the last 
level played. After a few sessions, students were mandatorily 
pushed to level 11 to ensure they had enough time to play more 
difficult levels. This affected 72% of the students who were at level 
9 or lower at the moment. On average, they were pushed by 4.3 
levels. 


3.4 DCCS Test and Score 

The DCCS task [36] was used to measure the EF sub-skill of 
shifting in the pretest, post-test, and delayed post-test. Scoring was 
based on the National Institute for Health (NIH) scoring procedure 
[37]. This is a combination of the accuracy score and the reaction 
time score. The score ranges from 0 to 10. Floor or ceiling effects 
were not observed with our participants, as the top 25% of pretest 
scores ranged between 7.78 and 9.36. 


4. METHOD 
4.1 Data Cleaning 


Based on the researchers’ observations, we removed 33 participants 
for the following reasons: (1) did not complete one of the DCCS 
tests, (2) were off-task during the DCCS test, or (3) experienced 
technical difficulties that would affect their performance in the 
DCCS test. Furthermore, 23 participants were removed due to off- 
task behaviors (e.g., sleeping, non-stop talking, etc.) or an absence 
for at least one intervention session. 


Eighty-one students remained in the analytic sample. The retention 
rate was 59.1%, which is acceptable for two reasons. First, data 
collected in the classroom setting are usually messier than that in 
the lab setting. During the study, a few participants went to the 
bathroom for a long time, which would be less likely to happen in 
a lab setting. Second, the game’s focus on training EF required a 
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degree of attention that some students were not willing to invest. 
Some students found it difficult to remain attentive for an extended 
period of time. 


4.2 Labels 


Table 1 lists the labels for prediction. The post score and delayed 
post score were directly measured by the DCCS test. We next 
calculated the post-learning gain and delayed post-learning gain. 


Table 1. Labels 


Name Description 


post score The EF score for the post-test. 


post-learning Relative gain of the EF score for the post-test 
gain compared with the pretest. Based on Hake’s 
formula of learning gain [13], it is calculated 
as (post score - pre score)/(10 - pre score) 
because the EF score ranges from 0 to 10. 


delayed post The EF score for the delayed post-test. 
score 
delayed post- Relative gain of the EF score in the delayed 


learning gain post-test compared with the pretest. 


4.3 Feature Engineering 

We generated 20 game-level features and five level-specific 
features for each level that indicated student performance and 
progress. They capture information related to accuracy and reaction 
time in various mathematical formats and granularities. 


Level-specific features were features for a single level. They 
included the average reaction time, the standard deviation of 
reaction time, accuracy, the number of correct hits (i.e., an action 
of feeding an alien with the correct food), and the number of wrong 
hits (i.e., an action of feeding an alien with the wrong food) across 
all aliens in a single level. Accuracy was calculated as the number 
of correct hits divided by the total number of aliens in a level. 


Game-level features were aggregated features across all levels. 
They included: (1) the average, maximum, minimum, range, and 
standard deviation of a student’s accuracy across all levels after 
calculating the accuracy for a single level across all aliens in that 
level; (2) the average, maximum, minimum, range, and standard 
deviation of a student’s reaction time across all levels after taking 
the average reaction time for a single level across all aliens in that 
level; (3) the total number of correct hits (82% of all aliens among 
all students), wrong hits (16%), and missed hits (1.e., an action that 
the student did not feed an alien) (1%); (4) the highest number of 
stars a student received across all levels and the total number of 
stars a student received in the game; (5) the number of levels a 
student skipped by choice (which only happened before level 10) 
and due to the mandatory push; and (6) the highest level and the 
total number of levels a student played (as a student may skip a few 
levels). 


4.4 Model Training 


We used the linear regression for predictive modelling in 
RapidMiner 9.3. We evaluated the model’s performance using ten- 
fold cross-validation at the student level to ensure the model would 
be generalizable to a new student population. During this process, 
students were randomly split into 10 groups. For each possible 
combination, we used forward selection to select features and then 
built the model based on the training data. Forward selection was 
an iterative process. First, a single-feature model that would 
achieve the highest Pearson correlation was chosen. Next, the 
remaining features were subsequently added one-by-one to the 


model if they could appreciably improve the model goodness of fit. 
In addition, to avoid collinearity, we set the minimum tolerance for 
eliminating collinear features as 0.05 and set “eliminate collinear 
features” as true in the linear regression operator. 


In addition, we explored different combinations of features for 
feature selection. Missing values existed for many level-specific 
features. Thus, we began with the first feature set containing all 
game-level features and level-specific features of the level with the 
smallest missing value rate. After that, in each round, we added 
level-specific features of another level based on the ranking of the 
missing value rate. We stopped doing so at a level that contained 
missing data for 16% of students. The last model contained 65 
features. In this way, we controlled for over-fitting and ensured the 
models were trained on representative levels. 


5. RESULTS 


5.1 Intervention Effect 

The paired samples t-test show that the post score (mean = 6.98, 
SD = 1.56) is significantly higher than the pretest score (mean = 
6.31, SD = 2.12) (#80) = 3.01, p < 0.01, Cohen’s d= 0.34). Also, 
the delayed post score (mean = 7.16, SD = 1.33) is significantly 
higher than the pretest score (t(80) = 3.90, p < 0.001, Cohen’s d= 
0.43). The boxplot of three EF scores is shown in Figure 2. 


core 


EF Shifting 


pretest posttest delayed posttest 


Figure 2. Boxplot of three EF shifting scores. 


5.2 Correlation between Features and Labels 
Among the 65 features for modeling training, five features had an 
absolute value of correlation with the post score between 0.3 to 0.42. 
Eight features had an absolute value of correlation with the delayed 
post score between 0.3 to 0.45. Features were weakly correlated 
with two learning gain labels as the absolute values of all 
correlation coefficients are below 0.25. Missing values for each 
feature were replaced with the average value of that feature. 


5.3 Findings from Predictive Models 
Cross-validated metrics for the best models of each label and their 
features are summarized in Table 2. 


Table 2. Summary of the predictive models 


Post Score Post-Learning Gain 
RMSE 1.586 0.682 
Correlation 0.322 0.294 
Selected - 2.087 * 0.414 * avgLevelAvgRT 
Features numLevelsSkippedB yChoice + 0.065 * 


- 0.008 * numWrongLevell 1 
- 0.028 * numWrongLevell2 


numLevelsSkippedByPush 

- 0.001 * numCorrectLevel3 
+ 0.004 * numCorrectLevel4 
+ 0.493 * 
avgReactionTimeLevel2 

- 0.428 * 
avgReactionTimeLevel13 
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Table 2 (continued). Summary of the predictive models 


Delayed Post Score Delayed Post- 
Learning Gain 
RMSE 1.540 0.470 
Correlation 0.303 0.260 
Selected 2.189 * avgLevelAvgRT 0.841 * avgLevelAvgRT 
Features - 0.268 * - 0.262 * highestLevelAvgRT 


numLevelsSkippedByPush 

- 0.012 * numWrongLevel3 

+ 0.006 * numCorrectLevel12 
- 2.154 * 
avgReactionTimeLevel3 

+ 1.693 * 
stdReactionTimeLevel3 


+ 0.001 * totalWrongHits 

+ 1.434 * avgCorrectLevell 

- 0.003 * numWrongLevel3 

+ 0.007 * numCorrectLevel12 
- 0.009 * numWrongLevel12 
- 0.611 * 
stdReactionTimeLevel12 


- 2.306 * 
stdReactionTimeLevel12 
- 0.528 * 
avgReactionTimeLevel12 


The models that only used game-level features had a low 
performance. Excluding level-specific features only did not 
greatly affect model goodness when predicting the post score, 
with a correlation of 0.308 and RMSE of 1.589. Features included 
numLevelsSkippedByChoice, totalWrongHits, and 
numLevelsPlayed. 


6. DISCUSSION AND CONCLUSIONS 
Playing AYCET significantly improved students’ EF. The effect 
sizes of EF gains were medium, and that for the delayed post-test 
4-6 weeks later was larger than that for the post-test. This difference 
in effect sizes may be attributed to either the long-term intervention 
effect by the EF game or students’ natural development of EF. 
Though more evidence is needed, long-term effects of cognitive 
skills training have been found [18, 32]. 


We explored the possibility of a game-based assessment of EF 
using a game originally designed to train EF. We present four linear 
regression models that use the log data to predict students’ EF score 
of shifting in the post-test, delayed post-test, and the relevant 
learning gain scores. With correlations around 0.3, these models 
achieved good performance for preliminary work. This corresponds 
to the second challenge of this study, which is to separate log data 
related to EF training from log data related to play in the game 
context. Good performance of predictive models indicates that a 
learning game is a promising tool to measure EF. 


We generated an extensive list of game-level and level-specific 
features consisting of accuracy and reaction time indicators. Both 
accuracy and reaction time features are important in predicting EF 
but are two potentially distinct dimensions of EF. Generally, at the 
game level, a moderately higher reaction time and a more consistent 
reaction time (while controlling for other factors) are positively 
associated with EF. In addition, a lower reaction time and perhaps 
a more consistent reaction time are positively associated with EF. 
As for accuracy features, both correct hits and wrong hits are 
important for predicting EF. 


In addition to accuracy and reaction time features, the number of 
levels skipped, particularly by the player, was indicative of EF. This 
means that player progression and player performance are both 
important for predicting EF. 


Most selected features are from level 3 and level 12. This may 
suggest the key time window, which is the moment after students 
become familiar with the game mechanics and the moment after a 
drastic change in difficulty (recall the mandatory push; see section 
3.3) may best demonstrate their ability to perform shifting. Varying 
the difficulty of levels or allowing for some time for students to 


achieve level 12 may contribute to a better game-based assessment 
of EF. 


Responding to the first challenge, namely, the granularity and time 
scale for prediction, we found that level-specific features provide 
more promising results than game-level features only. It is worth 
further exploring variables at the action-level. 


7. IMPLICATIONS AND FUTURE WORK 


We explored the techniques of feature engineering and model 
training to investigate the game-based assessment of EF. The model 
performance is promising among studies that relate log data with a 
post-test measure in learning games [33, 34]. Another implication 
of this work is it sets the foundation for the real-time detection of 
EF and may provide the basis for dynamic interventions. 


Limitations in the current work inspire us to explore more 
possibilities of game-based assessments for EF. First, a level may 
be played multiple times by a student. In the current study, all 
attempts of the same level were aggregated. In the future, we will 
distinguish multiple attempts of the same level by generating 
features such as the number of attempts and performance change 
over attempts. Second, we found that students’ performance one or 
two levels after a challenging level is important. This game 
mechanic of difficulty change may not apply to other games. We 
have tried to interpret the model in the context of the specific game 
design. Third, students experienced a mandatory push in this study 
(see section 3.3). This is perhaps why features for level 12 were 
selected. To examine the generalizability of our findings, we will 
compare the prediction models built under two conditions, one of 
which replicates the push, while one does not; however, for 
practical reasons, it is also of interest to determine whether features 
that only cover earlier levels can predict the post-test and delayed 
post-test scores as this would require less game play for the 
assessment. Fourth, we filled the missing data with the average 
value of a feature. We did so by assuming data were missing at 
random. More robust methods, such as multiple imputations, could 
be used moving forward. 


In the future, we are interested in generating theory-based features 
at the action-level (i.e., alien-level) per student, hoping to allow for 
the real-time detection of EF and for an even better model. An 
action-level feature may be a student’s change in accuracy and 
reaction time within the first three aliens when the rule changes 
within a level. Another action-level feature may be the performance 
curve under different rules within a level. Both features align with 
the definition of the EF sub-skill of shifting and are not tied to 
specific levels, so they may produce more generalizable results. 
Methodologically, we are also interested in comparing the linear 
regression with other models, such as Support Vector Machines or 
the Random Forest. Substantively, it may be worth considering 
accuracy and reaction time as separate outcomes given research 
suggesting each contribute uniquely to performance on EF tasks in 
young children [9]. Further work may also apply methods of 
student modeling to other EF sub-skills, such as inhibition and 
updating in games that target these skills. 
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ABSTRACT 


Over the last decade makerspaces have become more popular and 
prevalent in formal and informal learning environments. A 
finding, however, is that makerspaces are often male-dominated, 
and females can feel a sense of intimidation in the space. 
Furthermore, maker-centered learning typically adopts an open- 
ended structure which makes it difficult to identify students who 
are struggling. In this paper, we explore the use of quantitative 
data from survey and motion sensors to potentially assist 
instructors in uncovering gender differences and promoting 
gender inclusion. Results suggest that there are different pathways 
for male and female students to thrive in makerspaces. Based on 
survey results, male students tend to have higher self-efficacy, 
resulting in more self-confidence in their abilities and more 
positive feelings. Findings from applying network analysis on the 
motion sensor data show that female students persevere more 
consistently and use empathy to form closer ties with peers for 
mutual support. These findings suggest that quantitative data 
could help raise instructors’ awareness of gender differences and 
use that information to cater to the unique learning needs of each 
group of students. Overall, this work represents preliminary steps 
in instrumenting makerspaces to promote gender inclusion and 
support maker-centered learning. 


Keywords 


Interaction Analysis, Learning Analytics / Educational Data 
Mining, Social Network Analysis, Broadening Participation, 
Gender, Making and Makerspaces, Technology-enhanced learning 


1. INTRODUCTION 


While many authors have espoused the learning benefits of 
makerspaces [5], other researchers have recognized the inherent 
difficulties of supporting student learning in makerspaces’ open- 
ended environment [13]. First, students are expected to solve 
problems independently in open-ended learning environments. 
Such independent work may lead to feelings of isolation, and 
instructors may not be aware that students are struggling. Second, 
the iterative nature of work in the makerspaces makes it difficult 
for instructors to continuously monitor students’ progress. 
Without a clear feedback system, it is challenging for instructors 
to differentiate when students need instructional support. 
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However, new sensing technology (such as motion tracking) 
offers an opportunity to address some of these challenges. The 
key benefit of using motion sensors is that they can be deployed to 
monitor students’ learning in a continuous and unobtrusive 
manner. Therefore, we aim to examine how the use of quantitative 
data can help instructors overcome inherent challenges of 
assisting students in makerspaces. 


For our scope, we examine how students from different genders 
interact in makerspaces [2;8;11], and we hope to promote gender 
inclusion in makerspaces. Eventually, we hope that the use of 
quantitative data can assist instructors in identifying the right form 
of support for each diverse group of students. 


2. LITERATURE REVIEW 


Makerspaces draw learners from a diversity of disciplines and 
provide multiple entry points to participation leading to 
“innovative combinations, juxtapositions, and uses of disciplinary 
knowledge and skill” [12]. However, makerspaces situated in 
formal learning environments are often male dominated [7]. 
Hence, it is an increasing priority for makerspaces to include 
women who are underrepresented in these communities. Central 
to this goal is understanding how women interact within 
makerspace courses. While some studies have not found gender to 
be a salient factor [2], other studies have shown that women often 
report feeling intimidated and excluded [8;11]. Most studies 
conducted in this area have also been qualitatively based profiles. 
Yet, for instructors to better support women in these spaces, more 
research must be conducted on gender differences in the 
cognitive, non-cognitive and affective domains and understand 
how these differences contribute to the outcomes of empowerment 
and community-building in maker-centered learning [5]. 


In this regard, the use of quantitative data from motion sensors 
could help provide alternative insights into gender differences. 
Researchers in the field of multimodal learning analytics have 
long explored the use of sensors to gather information on student 
learning because data can be collected in a sufficiently high 
frequency to draw rich inferences [3]. Since social interactions are 
an important part of makerspace projects, we focus this paper on 
capturing them using motion sensors. The successful utilization of 
motion sensors in capturing student interactions have been 
suggested by a couple of researchers [4;9]. One common thread in 
these previous works is the use of physical proximity as a 
rudimentary proxy for interaction. While being in close proximity 
is a necessary condition for interaction to occur, it is arguably not 
a sufficient condition. Therefore, in addition to the use of physical 
proximity as an indication for interactions, this study will layer on 
two other criteria (see Section 5.3). In essence, we hope that the 
use of quantitative data from motion sensors can paint a broader 
picture of women's experiences in makerspaces to improve 
instructor support and inclusivity. 
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3. CONTEXT OF STUDY 


Quantitative data was collected from 14 female and 8 male 
students enrolled in a 15-week makerspace course (no students 
identified as non-binary). Kinect sensors were deployed 24/7 to 
gather skeletal joint data from students and survey tools were used 
weekly to assess students’ learning experiences. 


3.1 Course overview 

The graduate-level makerspace course took place at a school of 
education in the northeastern part of the United States. With a 
focus on digital fabrication, the course aims to equip students with 
the necessary skills and knowledge to handle modern tools such as 
laser cutters. Each week, students are expected to work on a 
course assignment that typically involves the creation of a digital 
fabrication product for educational purposes. Depending on the 
requirements of the assignment, students could either work 
individually or collaborate. In addition to these weekly 
assignments, students also pair up to complete mid-term and final 
projects. While instructional support is available in the form of 
office hours and individual consultations, students largely work 
independently with minimal intervention from instructors. 


Because of the open-ended nature of makerspaces, the course is 
designed with several scaffolds. Every week, the same cycle of 
design-prototype-create is adopted for each course assignment. In 
this manner, students continually receive opportunities to refine 
their skills. The presence of these weekly cycles also provides the 
research team with a natural unit of analysis and all quantitative 
data collected is aggregated at the week level. 


3.2 Kinect Setup 

Six Kinect v2 sensors were deployed in the makerspace to collect 
skeletal joint data. The sensors were positioned to achieve 
maximum coverage of the space (see Figure 1). When an 
individual’s presence is detected in the Kinect’s field of vision, 
the Kinect starts to record the x,y,z coordinates of the individual’s 
head joint, left and right shoulder joints, left and right elbow 
joints, and left and right-hand joints. When there are multiple 
individuals present in the space, each Kinect sensor has the 
capability of recording up to 6 individuals at 30 Hz (i.e., 30 
observations per second). 


4. RESEARCH QUESTIONS 


RQ1: What gender differences can be extracted from quantitative 
data collected from a makerspace? 


RQ2: Which quantitative factors can account for students’ 
development of a sense of empowerment and community spirit? 


We examined students’ sense of empowerment and community 
spirit in the second research question because these are key 
attributes of a maker mindset [5]. 


5. METHODS 


In order to investigate how different genders work and interact in 
the makerspace, we constructed social networks from Kinect 
observations and derived network measures for each student 
(described in section 5.2 and 5.3). Additionally, we conducted 
weekly surveys of students to better understand their learning 
experiences (section 5.1). These surveys not only served as a 
triangulation measure for the Kinect observations, but also 
complemented the data by providing a more holistic description. 


Figure 1. Layout of makerspace with positions and fields of vision 
of the Kinect sensors (top). Picture of the makerspace (bottom) 


5.1 Survey Data 

Surveys were administered to students after class each week. 
These surveys were crafted based on a literature review of surveys 
and to validate the questions, we conducted a validation study 
with students from a previous iteration of the course. 


Table 1. Details of the surveys administered 


Dimensions Survey item Scale Source 
Cooniive - Tool Use Likert 1-7 General 
8 - Time Committed | Numerical | questions 
- Perceived 
Non- Competence . ? 

cognitive - Self-Regulation ae [eto 

- Motivation 

- Frustrated 

: - Nervous . 
Affective iercticd Likert 1-5 [14] 
- Inspired 
‘ - Sense of 
Maker 5 empowerment Likert 1-5 [5] 
attribute . an 
- Community spirit 
- Can-do attitude 

- Empathy 
Maker S - Curiosity Likert 1-7 5] 
mindset - Perseverance 

- Resourceful 

- Collaborative 


Referencing Table 1, students’ learning experiences were captured 
based on three dimensions: cognitive, non-cognitive, and 
affective. The two dimensions of maker’s attribute and maker’s 
mindset act as proxies for student outcomes. To determine gender 
differences, we conducted t-tests on these survey scores. 


5.2 Kinect Data 


Kinect observations were used for this study to infer the social 
interactions amongst students. Examining student interactions is 
key because communities represent an indispensable resource for 
students working in an open-ended environment. The following 
steps were taken to clean and process the Kinect data. 
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1) Student identification: Even though the Kinect sensors have no 
ability in establishing the identities of students, they capture video 
images from their fields of vision. These images were fed into 
OpenFace [1] to identify students. 


2) Data homography: The coordinate system that the sensors 
operate in is relative to the actual positions of the sensors in the 
space. Hence, there is a need to convert the data into a coordinate 
system that better represents the 3D positions of the skeletal 
joints. Data homography was used to achieve this. A research 
team member stood in front of each sensor at nine different 
locations, forming a grid. Using the marked-out grid locations on 
the floorplan of the space and the measured positions of the 
skeletal joints, the coordinate system of sensors was translated 
into a coordinate system that is based on the floor plan (Figure 1). 


3) Deduplication of skeletal joints: Finally, data from all six 
sensors were combined into a single coordinate system. However, 
because the sensors had overlapping fields of view, there was a 
possibility that multiple sets of skeletal joints were recorded for 
the same individual. In this case, deduplication was carried out to 
remove the additional skeletal joints for the same person. 


5.3. Social Network Analysis 

Once the Kinect data was processed, social networks were 
constructed. The social networks are built based on the episodes 
of student interaction. A student is said to have interacted with 
another if both students are one meter apart, have significant 
amounts of hand movement, and are either both sitting or both 
standing. The first criterion is based on the theory of proxemics 
[6], which states that humans maintain a comfortable distance of 
one meter during interactions. Admittedly, a proximity of one 
meter is a necessary but not sufficient criterion for establishing 
social interactions. Therefore, two other criteria were added to 
increase the probability of capturing true episodes of student 
interaction. The second criterion is based on the hands-on nature 
of the makerspace course. For two students who are in close 
proximity, having significant amounts of hand movement is likely 
an indication of collaboration. The third criterion is based on the 
observation that students tend to share the same eye level when 
working with each other. It is rare to observe two students 
working together with one individual standing and another sitting. 
While these three criteria are not perfect indicators of social 
interactions, observations of students working in the makerspace 
and crosschecks conducted by looking at screenshots from the 
sensors validated their use as a proxy for social interactions. 


After we identified episodes of student interactions, social 
networks were generated based on the amount of time each 
student spent interacting with others. In essence, the nodes of the 
social network represent the individual students while the edges 
between nodes are weighted according to the amount of 
interaction time spent between students. From the weekly social 
networks, network measures were computed to obtain weekly 
network scores for each student. 


Table 2. Details of network measures used 


ails Definition Scale 
measures 
This represents the fraction 
Degree : 
. of nodes that a node is Otol 
Centrality 
connected to. 
Average | This is the mean of all the 
edge weights of all the edges 0 to inf 
weight connected to a node. 


This index is calculated by “1: Complete 
taking the difference homophily (only 
between out-group and in- in-group 
group connections and connections) 
dividing by the total number | 1: Complete 
EI j : : 
: of connections. For instance, | heterophily (only 
homophily | . 
: in EI gender, a node for a out-group 
index : 
female student would have connections) 
male connections as out- 0: Equal number 
Bou connections and of in-group and 
emale connections as in- out-group 
group connections. SonTeCHOns. 


T-tests of the network measures were then conducted to extract 
gender differences, which addresses the first research question. 
For the second research question, the identified gender differences 
were used to build regression models for students’ sense of 
empowerment and community spirit. 


6. RESULTS 


RQ1: What gender differences exist in a makerspace (from the 
quantitative measures)? 


Table 3. Results of t-tests for gender differences 


Measures Statistical differences (t-test) 
Survey: Males students (mean=5.2) reported having a 
Perceived higher level of perceived competence than 
Competence female students (mean=4.8), t(20)=2.25, 
p=0.03. 
suivey: Males students (mean=4.0) reported feeling 
Tacs more interested in the course than female 
students (mean=3.7), t(20)=2.21, p=0.03. 
Sumer Males students (mean=3.7) reported feeling 
Taspited more inspired in the course than female 
students (mean=3.4), t(20)=2.22, p=0.03. 
Survey: Males students (mean=3.9) reported having a 
Pinpowenues , | Meater sense of empowerment than female 
students (mean=3.5), t(20)=3.11, p=0.002. 
Survey: Females students (mean=6.0) reported having 
Einpathy more empathy than male students (mean=5.5), 
t(20)=2.14, p=0.04. 
Survey: Females students (mean=5.9) reported having 
Re Yo more perseverance than male students 
(mean=5,3), t(20)=2.71, p=0.008. 
Network Males students (mean=0.02) have more 
measure: EI diverse gender interactions than female 
gender students (mean=-0.16), t(20)=4.19, p<0.001. 


Several items were different for males and females. First, males 
reported having higher perceived competence, which suggests that 
males are more confident individuals when it comes to assessing 
their abilities. Second, males recounted feeling more interested 
and inspired in the course. This shows that males possess more 
positive feelings towards the course. The lack of statistical 
significance for the negative affective states indicates that males 
and females might be struggling equally in the course. Third, 
males described developing a stronger sense of empowerment. 
This implies that males feel they have benefitted from the course 
and can move on to accomplish more challenging tasks. 


Although males reported doing better in the course than females, 
the t-test results also indicate that females may possess some 
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alternate mechanisms for thriving in the course. Females score 
higher on empathy, which suggests females relate better to others 
in the community. Additionally, females score higher on 
perseverance, which hints at positive struggles from females. 
Lastly, for the network measures, females score more negatively 
in the EI index for gender, which implies that females interact 
more with other females, possibly for more community and 
emotional support. 


RQ2: Which quantitative factors can account for students’ 
development of a sense of empowerment and community spirit? 


Findings from the survey data suggest that male and female 
students in this study differ in their perceived competence, 
positive feelings, empathy, perseverance and diversity in gender 
interactions. A linear regression model was built based on these to 
predict the students’ sense of empowerment and community spirit. 


Table 4. Regression models for sense of empowerment and 
community spirit (*p<0.05, **p<0.01, ***p<0.001) 


Outcomes (—) Sense of Community 
Predictors (|) empowerment** spirit*** 
F (4,17) = 6.85 F (1,20) = 16.72 
p-value = 0.0018 p-value = 0.0006 
R?=0.6172 R? = 0.4554 
RMSE = 0.7915 RMSE = 1.0513 
const coef. = -1.034 coef. = -2.672 
S.E. = 1.582 S.E. = 1.614 
p-value = 0.522 p-value = 0.113 
Empathy coef. = 1.123** 
S.E. = 0.275 
p-value = 0.001 
Perceived coef. = 0.770 
competence S.E. = 0.369 
p-value = 0.052 
Positive coef. = 0.232* 
feelings S.E. = 0.109 
p-value = 0.048 
Perseverance coef. = 0.954** 
S.E. = 0.242 
p-value = 0.001 
EI gender coef. = -0.441** 
S.E. =0.111 
p-value = 0.001 


Based on the regression analysis, students’ positive feelings, 
perseverance, and diversity in gender interactions are significant 
predictors for their sense of empowerment. Even though 
perceived competence is not statistically significant, its low p- 
value of 0.052 hints that it might be a contributing factor to 
students’ sense of empowerment (which corroborates with RQ1’s 
findings). Similarly, the presence of positive feelings in the model 
echoes previous findings of males having more positive feelings 
and a greater sense of empowerment. However, it is unclear if 
students developed a greater sense of empowerment due to their 
positive feelings or if students felt more positive because they 
experienced empowerment. Lastly, the inclusion of perseverance 
and diversity in gender interactions as significant predictors 
demonstrates that initial learning difficulties in makerspaces can 
be overcome if one perseveres and that reaching out to fellow 
members for peer support can aid in the process of learning. Since 
females have expressed higher levels of perseverance and more 
in-group preferences previously, this finding reveals a potential 
pathway for female students to develop a sense of empowerment. 


The regression analysis of community spirit shows that empathy 
is the sole significant predictor. Furthermore, the regression model 
with only empathy included has an R? value of 0.4554, which 
means that empathy as a factor alone can explain close to half of 
the variability in community spirit. This is not an unexpected 
finding as empathy remains a much-needed ingredient for the 
fostering of good relationships. This result also hints at possible 
contributions from females in building makerspace communities 
since they possess higher levels of empathy. 


7. DISCUSSION 


The findings of this paper indicate that males in this study are 
more confident in their technical ability and have more positive 
feelings associated with the makerspace. These findings run 
parallel to qualitative results in the literature which show that 
males tend to display more initial interest in makerspaces and 
technically oriented making activities [8]. While males self- 
reported more confidence in their abilities, females in this study 
were more persistent. Additionally, females reported higher 
measures of empathy and tended to interact more with other 
females when in the makerspace. These results are in line with 
qualitative findings from [11] indicating that females tend to 
appreciate having other females in the space. 


In terms of promoting gender inclusion, the methods used in this 
study can help reveal to instructors the salient differences between 
genders operating in their own makerspaces. When awareness of 
gender differences is promoted, instructors can be naturally 
prompted to take more active steps to cater to distinct learning 
needs. Additionally, these findings serve as a reminder for 
instructors to avoid taking on a deficit view of any gender. On the 
surface, it might appear that males are thriving better than females 
in makerspaces, but the lack of statistical significance for the 
negative affective states signals that males and females struggle 
equally. Instead, our results suggest that males and females thrive 
in their unique ways in makerspaces, with males using their 
higher individual self-efficacy, and females using their greater 
group empathy skills. Neither males nor females should be viewed 
in a deficit perspective, and the removal of any gender bias would 
certainly go a long way in promoting gender inclusion. 


Limitations of the current study include the relatively small 
sample size and the fact that the survey results were based on self- 
reported measures. These factors call into question the 
generalizability of our findings, and future work should seek to 
corroborate these results. Additionally, any reader of these 
findings must be careful to not fall into gender stereotypes. These 
results are reported on an aggregated basis, which may or may not 
be applicable to any individual student. Moreover, these findings 
are a result of our observations conducted in this particular study. 
Nonetheless, the findings demonstrate the feasibility of an 
approach that can be used by instructors to uncover gender 
differences in their own makerspaces. 


8. CONCLUSION 


The current paper examined gender differences in makerspaces 
and the factors that contributed to students’ development of a 
sense of empowerment and community spirit. T-test results 
indicate that there are different pathways for male and female 
students to thrive in makerspaces and regression analyses 
highlight the quantitative factors that can account for students’ 
development of a sense of empowerment and community spirit. 
This work presents preliminary steps in designing an automated 
system for instructional use to support gender inclusion. 
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ABSTRACT 


Recurrent neural network (RNN) achieves state-of-the-art in 
several researches of the performance prediction. However, 
accuracy in early time steps is lower than that in late time 
steps, even though the early detection of at-risk students is 
important for timely interventions. To improve the accu- 
racy in early time steps, we propose a knowledge distillation 
method for RNN. Our method distills the time-series infor- 
mation in the RNN model of late time steps into the RNN 
model of early time steps. This distillation makes the predic- 
tion of early time steps closer to that of late time steps. The 
experimental result showed that our method improved the 
detection rate of at-risk students compared with traditional 
RNNs, especially in early time steps. 


Keywords 
Student performance prediction, Early detection of at-risk 
students, Recurrent neural network, Knowledge distillation 


1. INTRODUCTION 


The detection of at-risk students is an essential task to en- 
sure intervention as early as possible. At-risk students are 
those who may drop out of lecture courses and have low 
scores (e.g., grade point averages and quiz scores). When 
potential at-risk students are automatically detected in the 
early stage of courses, teachers can have sufficient time to 
encourage them to continue learning. 


In recent years, prediction models based on recurrent neu- 
ral networks (RNNs) have reached high performance [1, 3, 
6, 7, 10, 11, 14, 19]. RNNs can handle time-series informa- 
tion such as weekly learning behavior and predict students’ 
performance in each time step. Therefore, RNNs can detect 
at-risk students in each time step such as after each lecture. 
However, prediction accuracy in early time steps is lower 
than that in late time steps because it is difficult for RNNs 
to extract representative features from only the time-series 
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Figure 1: KD for the earlier detection of at-risk students. 


information in early time steps. 


To solve this problem, we propose a novel training strat- 
egy for improving the prediction in early time steps. Figure 
1 shows an overview of our proposed method. Traditional 
RNNs can extract more representative features in later time 
steps, and prediction accuracy can also increase because 
RNNs can use longer time-series information. If RNNs can 
obtain more representative features from the inputs of ear- 
lier time steps, they can detect at-risk students earlier and 
maintain detection accuracy. 


To transfer extracted features, we use knowledge distillation 
(KD) [4]. KD is a compression method for deep neural net- 
works (DNNs), and many methods have been proposed in 
several fields such as visual recognition [2, 8] and natural lan- 
guage processing [5, 13, 15]. In KD, the model is compressed 
by training a small DNN model (student model) from a large 
DNN model (teacher model); that is, the knowledge in the 
teacher model is distilled to the student model. Further, KD 
does not require new annotations. In our method, KD is ap- 
plied to transfer the representative features extracted from 
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longer time-series information. As shown in Figure 1, this 
distillation makes the prediction of early time steps closer to 
that of late time steps, allowing us to detect at-risk students 
earlier. 


The contributions of this study are summarized as follows. 


e We introduce KD to predict students’ performance. 
To the best of our knowledge, this is the first study to 
apply KD to performance prediction. 


e We propose the RNN-FitNets model to improve early 
performance prediction. This model performs as if the 
learning behaviors in all the time steps are inputted, 
even though the model only receives the learning be- 
havior in early time steps. 


e We evaluate the effectiveness of our model for detect- 
ing at-risk students based on the learning logs collected 
from a higher education course. 


2. KNOWLEDGE DISTILLATION 


RNN MODEL 

In this study, we propose RNN-FitNets, which is an inte- 
gration of RNNs and FitNets [12]. RNN-FitNets distils the 
well extracted features in the later time step into the RNNs 
in the earlier time steps by using the architecture of Fit- 
Nets. Therefore, RNN-FitNets can improve the prediction 
accuracy in the earlier time steps. For example, as shown in 
Figure 1, RNN-FitNets can extract representative features 
in time step 3, whereas traditional RNNs obtain the same 
feature in time step T. 


Figure 2 shows the architecture. The teacher model is pre- 
trained using all the time steps (1, 2,...,7'), and the student 
model is trained until time step t (1 < ¢ < T). During the 
pre-training of the teacher model and training of the student 
model, the same ground truth holds (e.g., the final grade 
is passed to all the time steps). The teacher and student 
models have the same structure; only the time steps differ 
between them. Therefore, unlike FitNets, no regressor that 
transforms the size of the hidden layer of the student model 
exists. 


The student model is trained using two steps in each train- 
ing epoch as with FitNets. First, it updates its parameter, 
except for the output layer. Given the t-th time step feature 
vector of the student model as hi and T-th time step feature 
vector of the teacher model as h’p, the parameter is updated 
by minimizing the following hint loss function Lyr: 


1. 
Lire = 5 |r — hal? (1) 


After updating the parameter, the entire student model, in- 
cluding the output layer, is updated by minimizing the dis- 
tillation loss. Given the output of the student model as 
¥i,Yo.--Y,, L-th output of the teacher model as yi, and 
ground truth as y;,4., the distillation loss Lx p is calculated 
as follows: 


t 


Lkp= : YH rues ¥s)) + AH(Yr,¥:)- (2) 
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Figure 2: RNN-FitNets. 


Table 1: Grade point average distribution. 
GPA A/]B|C/D]F 
Number of students | 25 | 50 | 16 | 12} 5 


where H refers to the cross-entropy and X is a hyperparam- 
eter that balances both cross-entropies. 


3. EXPERIMENT 


3.1 Dataset 

We used the same dataset as [10]. The data were collected 
from the Information Science course at Kyushu University. 
This course started in April 2016 and 15 lectures were held 
weekly. Table 1 shows the grade point average of the 108 
students that took this course. More than two-thirds of stu- 
dents received an “A” or “B.” On this course, the teacher and 
students used a learning support system called M2B [9]. The 
M2B system consists of three subsystems: the learning man- 
agement system, Moodle; the e-portfolio system, Mahara; 
and the e-book system, BookLooper. Moodle recorded stu- 
dents’ attendance, submission of reports, and access to the 
course. Mahara recorded students’ logbook in each lecture 
on the course. BookLooper recorded students’ reading be- 
havior such as turning pages, drawing highlights, and taking 
notes. 


We also applied the feature engineering method used by [10]. 
The collected data were converted into active learner points, 
as shown in Table 2. As shown in the table, the learning be- 
havior of each lecture was evaluated on a five-point scale 
(0-5). Attendance and report submission were evaluated 
based on whether the activities were on time, late, or not 
completed. The quiz was evaluated based on the ratio of 
collected answers. The other behaviors were evaluated by 
comparing the students in each lecture. Before inputting 
these features into the prediction model, the evaluated val- 
ues were divided by 5 (i.e., they were normalized with in the 
range of 0 to 1). 
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Table 2: Criteria for active learner points. 


Activities 5 4 3 2 1 0 
Attendance tee Being Absence 
dance late 
; Above Above Above Above Above Other 
aad 80% 60% 40% 20% 10% wise 
Report oo Late None 
ssion 
Course Upper Upper Upper Upper Upper Other 
accesses 10% 20% 30% 40% 50% wise 
Word count Upper Upper Upper Upper Upper Other 
in Mahara 10% 20% 30% 40% 50% wise 
Reading time Upper Upper Upper Upper Upper Other 
in BookLooper 10% 20% 30% 40% 50% wise 
Highlights Upper Upper Upper Upper Upper Other 
in BookLooper 10% 20% 30% 40% 50% wise 
Notes Upper Upper Upper Upper Upper Other 
in BookLooper 10% 20% 30% 40% 50% wise 
Total Actions Upper Upper Upper Upper Upper Other 
in BookLooper 10% 20% 30% 40% 50% wise 


3.2 Evaluation Criteria 

We applied 5-fold cross-validation to the 108 students in the 
dataset. The folds were made by preserving the percentage 
of samples for each student’s grade of “A,” “B,” “C,” “D,” 
and “F.” After the separation, we grouped grades “A” and 
“B” into the “no-risk” class and grades “C,” “D,” and “F” into 
the “at-risk” class because more than two-thirds of students 
received “A” or “B” (see Table 1). Therefore, we conducted a 
binary classification between “no risk” (“A” or “B”) and “at- 
risk” (“C,” “D,” or “F”). For the evaluation, we calculated the 
recall, precision, and F-measure values for detecting at-risk 
students. 


3.3. Comparison Models 

To investigate the effectiveness of our model, we compared 
the evaluation values for predicting the final grades between 
the following three types of models: 


e RNN baseline model 


Training the RNN-based prediction model using the 
learning behavior in all lecture weeks. 


e Week-by-week model 


Training the RNN-based prediction model using the 
learning behavior in each lecture week. Therefore, 
there were 15 independent models (trained by only 
first-week behavior, trained until the second week of 
behavior, and so on). 


e RNN-FitNets 


Training the student model from the RNN baseline 
model as the teacher model in each lecture week. As 
with the week-by-week model, there were 15 student 
models. 


The three types of comparison models had the same archi- 
tecture. We set the batch size to one. The length of time 
steps took an integer from 1 to 15 when the three types of 
comparison models predicted students at-risk. When the 
models were trained, the RNN baseline model used 15 time 
steps and the other models used the same time steps as the 
prediction. The input features of the model were the active 
learner points shown in Table 2; therefore, the number of 
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features was nine. For the hidden layer, we used GRU with 
32 units and the activation function was tanh. The output 
layer had two units and the activation function was softmax. 
We used RMSprop optimizer [16] for the hint loss and dis- 
tillation loss. In both the optimizations, we set the learning 
rate to 0.001. In addition, we applied L2 regularization with 
a parameter of 0.004 for the optimization of the weights and 
biases in the hidden and output layers of the RNN baseline 
and the week-by-week model. A in the distillation loss (Eq. 
(2)) was equal to the time step; for example, when RNN- 
FitNets was trained using the learning behavior until the 
second week, we set A to 2. All models were trained for 50 
epochs. 


3.4 Experimental Result 
Figure 3 illustrates the evaluation of the three types of mod- 
els. We summarize the results as follows: 


In most time steps, the recall values of the RNN-FitNets 
were higher than the values of the RNN baseline and 
week-by-week models. In other words, RNN-FitNets 
detected more at-risk students than other models. 


However, the precision values of the RNN-FitNets were 
lower than the value of the RNN baseline model, i-e., 
RNN-FitNets misdetected more no-risk students as at- 
risk. 


As shown by the F-measure values, the RNN-FitNets’ 
values were higher than that of the RNN baseline and 
week-by-week models in most time steps. This dif- 
ference was marked in early time steps. Therefore, 
the increase in the detection of at-risk students out- 
weighed the increase in the misdetection, especially in 
early time steps. 


Comparing the evaluation values of the RNN baseline 
model with those of the week-by-week model, the for- 
mer was superior in early time steps, although the val- 
ues of the week-by-week model were close to or outper- 
formed those of the RNN baseline model in late time 
steps. 
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Figure 3: Evaluation of three types of models. 
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Figure 4: Visualization of the extracted feature vectors in the 
three types of models by t-SNE. 


3.5 Discussion 

The experimental result showed that the proposed models 
improved the detection rate of at-risk students, especially 
in early time steps. This improvement resulted from the 
distillation of time-series information. The evaluation values 
of the RNN baseline model were higher than those of the 
week-by-week model in early time steps. This result implies 
that the time-series information obtained by training in all 
the time steps is effective for early detection. In the RNN- 
FitNets, the time-series information was expressly passed 
through KD and that improved the model’s performance. 


To investigate whether the time-series information was dis- 
tilled into the models in early time steps, we visualized the 
extracted feature of the three types of models. Figure 4 
shows the visualization results of the feature vectors. Be- 
cause the models have a 32-dimensional hidden state, we 
used t-SNE [17] and reduced the 32 dimensions to two di- 
mensions for the visualization. Each point represents each 
feature vector for the students in the dataset. The red point 
is at-risk students and the blue point is no-risk students, 
as defined in Section 3.2. By observing the feature vectors 
of the RNN baseline and week-by-week models, the more 
time steps are used, the closer the red points are to each 
other and the more the shape of the mass of points becomes 
elongated. This means that the detection of at-risk students 
becomes easier in the feature vectors of late time steps. In 
the RNN-FitNets models, the tendency to gather red points 
and elongate appears in early time steps. This result shows 
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that our KD method properly distills the time-series infor- 
mation extracted in the late time step. 


4. CONCLUSION 

In this study, we proposed RNN-FitNets, which extends Fit- 
Nets, a KD method, for application to RNN architecture. 
RNN-FitNets transfers the time-series information extracted 
by the later time-step RNN into an earlier time-step RNN. 
Hence, the earlier time-step RNN learns the method of ex- 
tracting the representative features in late time steps from 
short time-series data. 


In the experiment, we applied RNN-FitNets to detect at- 
risk students in higher education. The results show that the 
proposed distillation model improves the detection rate of 
at-risk students from the base RNN models. The analysis of 
feature vectors indicated that our proposed model in earlier 
time steps extracted similar feature vectors to those of the 
base model in late time steps. This confirmed that our distil- 
lation strategy properly distilled the time-series information 
in later time steps into the model in earlier time steps. 


In future work, we plan to investigate the availability of 
RNN-FitNets for other datasets. Moreover, we aim to for- 
mulate a new distillation method for time-series information 
for other models such as the Transformer model [18]. 
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ABSTRACT 


Sequential pattern mining is a useful tool in understanding learning 
processes, but identifying the most relevant patterns can be a 
challenge. Typical sequential pattern mining algorithms and 
interestingness metrics mainly focus on finding behavior patterns 
common across all students. However, educational researchers also 
care about individual differences. This study proposes a method for 
finding sequential patterns which usage have high variation across 
students. This method borrows techniques from the field of lag 
sequential analyses and meta-analyses. It uses the log odd ratio to 
model the individuals’ usage of a sequential pattern and the 
heterogeneity test to examine the usage variation. We applied this 
method to analyzing student action logs in a virtual experimental 
environment and present preliminary results illustrating how the 
identification of sequential patterns with high usage variation 
provides interesting information about students' learning behavior. 
The proposed approach adds a way for understanding individual 
differences in learning processes. 


Keywords 


Sequential pattern mining, learning behavior differences, log odds 
ratio, lag sequential analysis, heterogeneity test 


1. INTRODUCTION 


Sequential pattern mining (SPM) aims to find the temporal 
associations between events [1]. For example, whether students 
read relevant material after answering a question incorrectly. Such 
sequential behaviors are named sequential patterns. SPM has 
shown its potentials in helping researchers understand learning 
behavior [2, 3]. 


However, there are challenges when applying SPM in education. 
One important challenge is that SPM algorithms may generate 
excessive sequential patterns, most of which are uninteresting or 
irrelevant to the research purpose [2]. This increases the difficulty 
of making meaningful interpretations and producing actionable 
pedagogical insights. To address this challenge, researchers select 
sequential patterns using interestingness metrics, such as the 
support value e.g., [4, 5]. The support value of a sequential pattern 
is the proportion of students that shown this pattern. As such, 
patterns with high support values will reflect similarities in the 
learners’ behavior. 
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Educational researchers also care about differences among students 
[6]. The understanding of individual differences in learning is 
essential for providing learners with adaptive scaffolding. To 
address this need, this study proposes a method borrowing from lag 
sequential analyses and meta-analyses that uses log odd ratio and 
the heterogeneity test to select sequential patterns based on their 
variation in usage across learners. 


2. Methodology 

Let E = {€, €2, ...,€p} be a set of p unique events that may occur 
within a specific learning environment, such as answering a 
question and asking for a hint. Let S,, = {i,,iz,...,in} be a 
sequence of N temporally ordered items with each i; being a subset 
of E. A sequence is a student’s learning process data, such as action 
logs in an intelligent tutoring system. Each i; usually contains one 
event because students rarely initiate two different actions 
simultaneously. Let e, > e, be a sequential pattern where e, 
occurs after e, (€, and ey may be the same event). Let e€, denote 
an event other than e,. If there are i, = ey, i; = ey, andk</in Sy, 


Sm contains ey > ey [8]. 


2.1 Using log odds ratio to model sequential 


pattern usage 

If we fix the gap between e, > ey to a constant c, we may use 
methods from the field of lag-sequential analyses to quantify 
students’ usage on e; —> @y [8]. Fixing the gap to c means that we 
only consider i, = ey, i, = ey, andl — k = cas an occurrence of 
€, > @y. For example, c = 1 means that we only count the case 
where ey directly follows e,. Lag-sequential analysis utilizes 
statistics from contingency table analyses to quantify the usage of 
€, — @y, such as the odds ratio and the log odds ratio [8]. 


Let the frequency of pairs of consecutive events where the first 
event is e, and the second event is ey n(ex aad ey) = Am, Let the 
frequency of pairs of consecutive events where the first event is ey 
but the second event is not ey n(ex > 2) = by. Let the frequency 
of pairs of consecutive events where the first event is not e, but the 
second event is ey n(ey > ey) = Cm. Let the frequency of pairs of 
consecutive events where the first event is not e, and the second 
event is not e, n(ex > ey) = dy. The odds ratio of e, > ey in Sm 


am 


can be calculated as 
bm 


Amdm 
Dem 
However, there is measurable bias in this expression when the 
sample is small. A slightly modified version is often used to reduce 
bias [9]: 


oa while the log odds ratio is log 


1 1 
(Qn + 7) (din 1 D 
(Om + 5)(Gm +5) 


The log odds ratio of e,, > ey represents the relative likelihood that 
€y occurs after e, during a student’s learning, considering the 


Yn Gx aad ey) = log (2) 
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probability that e, occurs after an event other than e,. If a 
sequential pattern contains more than two events, researchers may 
segment a sequential pattern into two sub-patterns and represent the 
sequential pattern as one sub-patterns follows another. For 
example, e, > €y > €, > €y may be represented as eyy > 
€zsyw.- This preprocessing has been used in computing the 
confidence value of sequential patterns longer than two events [10]. 
Then, the above procedure can be used to calculate the log odds 
ratio. 


The variance of Yin (ex > ey) iS: 


1 1 1 1 
Vin(€x > ey) = it hk: it T (3) 
Am +5 bm + 5 Cm + 5 2 
Vin (e, > ey) characterizes the imprecision of the log odds ratio 
and decreases as the length of S,, increases. The log odds ratio 
based on a long sequence is more precise than that based on a 
short sequence [9]. 


dm + 


2.2 Ranking sequential patterns by variation 


across users 

We can examine whether the log odds ratio varies across 
participants via the heterogeneity test used in meta-analyses [11]. 
One commonly used heterogeneity test is the Q test [12]. In meta- 
analyses, Q is the weighted sum of the squared deviations of each 
study’s effect estimate from the weighted mean of all studies’ effect 
estimates. The weighting for each study is the inverse of the 
variance of the study’s effect estimate. Thus, in terms of the 
variation of the log odds ratio of e, > ey, O can be calculated using 
the formula: 


—2 
(mm -7) 
Qe >e)=) a, (4) 
m 
where Y is the weighted mean of log odds ratios, 1.e., 
Ym 
Ly 
oy 
Li 
Q follows a chi-square distribution with k — 1 degrees of freedom, 
where k is the number of sequences or participants. Thus, if 
Q(ex > ey) is higher than the critical value for a given significance 
level (e.g., 0.05), we may conclude that the usage of e, > e, has 
statistically significant variation across participants. Moreover, for 
the same dataset, the number of participants is constant, and thus, 
the Qs of all sequential patterns follow the same chi-square 
distribution and are comparable. However, it is difficult to interpret 


Q because its magnitude is influenced by the number of 
participants. The I? index overcomes this issue [13]. 


Q-(k—1) 
Q 


Y= (5) 


«100%, if Q>(k—-1) 


(6) 


0, else 
r (e, > ey) can be interpreted as the proportion of variation in the 
log odds ratio of e, > ey due to true between-participants 


variance. Ranking sequential patterns by Q and I? produces the 
same results because k is fixed for the same dataset. 


3. Example 
This section applied the proposed method to a dataset of student 


action logs collected from a virtual experiment environment called 
LabBuddy [14]. 


3.1 Data 


3.1.1 Participants 

The data were collected from a graduate-level enzymology course 
at a university in the Netherlands. Participants were 76 graduate 
students in this course. The average age was 22.91 years old 
(SD = 1.80). Around 64.47% of the students were female. 


3.1.2 LabBuddy 

The course helped students prepare for the laboratory classes using 
LabBuddy. LabBuddy in this course contained a self-directed 
learning task, which included six research questions offered by a 
virtual tutor, Professor Kabel. Students start with proposing 
hypotheses for each question and make an experimental design via 
a flow chart to test the hypotheses (Figure 1). Each block in the 
flow chart represents a chemical method and contains details about 
the method. Each block also contains some closed questions that 
students must answer correctly before implementing the method 
and getting the raw data. Students do some calculations based on 
the raw data to get the results. The details, raw data, and 
calculations of a method are located in different subblocks of a 
block. If students are struggling with a closed question, they may 
request hints or the correct answer. Once students obtain the results, 
they may consult Professor Kabel to interpret them and either 
accept or reject their hypotheses. Students used LabBuddy for an 
average of 7.5 h distributed over three days. Their action logs were 
used for analysis. 


3.2 Analyses 

We preprocessed the action logs by removing redundant successive 
repeated actions (e.g., multiple selections of the same block) and 
contextualizing some actions (e.g., is the submitted answer to a 
closed question correct?). The preprocessing resulted in 19 unique 
events. The average number of events in a student’s action log was 
995 (SD = 363). Then, we implement our methods via the following 
procedure: 


1. Apply the cSPADE algorithm to find frequent sequential 
patterns with support no less than 0.5. We used this 
algorithm because it allows us to fix the gap between 
events in a sequential pattern, a prerequisite for 
calculating the log odds ratios. The gap was fixed to | in 
the analysis. For simplicity, we only focused on 
sequential patterns containing two events. This step 
generated 81 frequent sequential patterns. 

2. For each student, compute the log odds ratio, variance, 
and the number of occurrences of each frequent 
sequential pattern. 

3. For each frequent sequential pattern, conduct the Q test 
and calculate the I? index, the average log odds ratio, and 
the average occurrence. As the Q test was run 81 times, 
we used the Benjamini-Y ekutieli correction to control the 
false discovery rate [15]. 


Note we only apply our method to frequent sequential patterns 
because the variation of a sequential pattern across participants 
would be low if few participants used a pattern (i.e., it was 
infrequent). 
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Figure 1. The LabBuddy learning environment. 
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Figure 2. The I” indexes, average log odds ratios, and average occurrences of sequential patterns. 


3.3. Preliminary results 

Figure 2 visualizes the relationships among the I? indexes, average 
log odds ratios, and average occurrences of the 81 sequential 
patterns. There were moderate positive relationships between the 
I? index and the average log odds ratio (r= 0.57, p < 0.001) as well 
as the average occurrences (r = 0.36, p = 0.001). Nevertheless, 
ranking sequential patterns by their variation between students 
results in a different set of selected patterns than ranking them by 
their similarities (average log odds ratios and occurrences) between 
students. Some sequential patterns had few average occurrences 
(e.g., less than 5) or negative average log odds ratios while still 


being used differentially by students (the adjusted p of the Q test 
< 0.05). Some sequential patterns had relatively high average 
occurrences (e.g., larger than 10) or average log odds ratios (e.g., 
larger than 1) but were used consistently across students (the 
adjusted p of the QO test > 0.05). 


We investigated how I? might help us detect behavioral differences 
by looking more closely at two sequential patterns with distinct 7: 
Submitting an intermedia answer — Submitting an intermedia 
answer and Requesting a hint + Requesting a hint. Both patterns 
had high values in average occurrences and log odds ratios (see 
Table 1). Submitting an intermedia answer — Submitting an 
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intermedia answer had a I? of 0.75 (p < 0.001), indicating that 
students had high variations in the usage on this pattern. Further 
analysis showed that, in 9.54% of all pairs of students (309/3,240), 
the log odd ratio was significantly different between the two 
students. This means that among 10 randomly sampled pairs, on 
average, there was one pair where the two students had significantly 
different probability of submitting two intermedia answers 
consecutively. In contrast, the usage on Requesting a hint > 
Requesting a hint was relatively consistent across students (7 = 
0.24, p = 0.52). Analyses showed that, in only 1.6% of pairs, the 
log odd ratio was significantly different between two students. 


Table 1. The metrics of two sequential patterns 


Pattern I? pror me Log odds Occurrences 
Q test ratio 

RH.RH 0.24 0.52 3.76 21.51 

SLSI 0.75 0.00 4.81 19.32 


Note. RH.RH: Requesting a hint — Requesting a hint. SS.WA: 
Submitting an intermedia answer — Submitting an intermedia 
answer. 


4. Discussion 

This study proposed a method for mining sequential patterns which 
usage has high variation across students. We applied the method to 
a dataset of student action logs in a virtual experimental 
environment. The preliminary results suggest that ranking 
sequential patterns by their variation across students results in a 
different selection of patterns than by their similarities across 
students. Moreover, the results demonstrated how the proposed 
method could capture individual differences in sequential behavior 
patterns. The approach adds a way for understanding individual 
differences in learning, which is critical in education. 


The next step is to examine whether the sequential patterns with 
high variation are related to students’ learning gains. Such 
investigation would contribute to our understanding of how 
differences in which sequential patterns may lead to differences in 
learning outcomes. The insights, in tum, would provide 
information about how the learning environment might scaffold the 
learners’ interaction with the learning environment by prompting 
sequential behavior patterns beneficial to learning and discouraging 
patterns harmful to learning. 


Our approach requires fixing the gap between events of a sequential 
patterns. This requirement limits flexibility. For example, 
researchers may regard Submitting an intermedia answer — 
Requesting a hint — Submitting an intermedia answer as an 
instance of Submitting an intermedia answer — Submitting an 
intermedia answer, but fixing the gap to 1 excludes this possibility. 
On the other hand, if fixing the gap to 2, Submitting an intermedia 
answer directly after Submitting an intermedia answer would not 
be regarded as an instance of Submitting an intermedia answer > 
Submitting an intermedia answer. The limitation is the same as the 
issue that the lag between the antecedent and consequent events 
must be fixed in a lag sequential analysis [8]. Addressing this issue 
is challenging but worthy of effort. 
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ABSTRACT 


We present a demonstration of REACT, a new Real-time 
Educational Al-powered Classroom Tool that employs EDM 
techniques for supporting the decision-making process of ed- 
ucators. REACT is a data-driven tool with a user-friendly 
graphical interface. It analyzes students’ performance data 
and provides context-based alerts as well as recommenda- 
tions to educators for course planning. Furthermore, it in- 
corporates model-agnostic explanations for bringing explain- 
ability and interpretability in the process of decision making. 
This paper demonstrates a use case scenario of our proposed 
tool using a real-world data set, and presents the design of 
its architecture and user-interface. This demonstration fo- 
cuses on the agglomerative clustering of students based on 
their performance (i.e., incorrect responses and hints used) 
during an in-class activity. This formation of clusters of 
students with similar strengths and weaknesses may help 
educators to improve their course planning by identifying 
at-risk students, forming study groups, or encouraging tu- 
toring between students of different strengths. 


Keywords 
Clustering, Decision-support, Educational tool, Explainabil- 
ity, Human-centered computing 


1. INTRODUCTION 


Instructors play a crucial role in educational institutions, 
where one of their main responsibilities is effective high- 
quality teaching. To do so they must stay updated with 
students’ responses, efforts, and outcomes, in order to pro- 
vide timely feedback to promote students’ improvement 

[29]. One of the ways this can be achieved is by clustering 
students into groups based on various characteristics such 
as their learning style preferences, academic performance, 
behavioral interaction, etc., which can be utilized to explore 
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collaborative learning opportunities and identify at-risk stu- 
dents at an early stage [3]. Thus, this creates a need for tools 
that will empower instructors to achieve these objectives in 
the classroom. To this end, the fields of Educational Data 
Mining (EDM) and Learning Analytics (LA) have emerged 
with the goal to understand how educational data can ben- 
efit the science of learning |7). One of the ways to promote 
this understanding is to use Al-powered real-time visualiza- 
tions. These visual displays summarize large amounts of 
data in a meaningful way. This is important for humans’ 
sense-making and decision-making as it helps human 
cognition [12]. An example of these visual displays are dash- 
boards which may contain various data indicators [20]. 


Furthermore, applications of Artificial Intelligence (AI) in 
the domain of education for predicting student performance, 
detecting undesirable student behavior, or providing feed- 
back for supporting instructors and students, are becoming 
more common (3. This creates a need for incorporating 
interpretability, explainability, and, ultimately, trustworthi- 
ness in AI for supporting human teaching and learning [39]. 
The simplest way to include explainability in AI is by us- 
ing model-agnostic explanations that consist of textual and 
visual explanations [4]. Interpretability can be achieved by 
including humans in the process of decision making (Hit AI) 
i.e., decision power is given to the specialized professionals 
who utilize machines/tools as advisors [41]. 


This paper presents a demonstration of REACT, a Real- 
time Educational Al-powered Classroom Tool, which utilizes 
the principles of HitAI and model-agnostic explanations to 
support educators in their decision-making. REACT clus- 
ters students based on their responses during in-class activi- 
ties, and provides context-based recommendations for course 
planning. It also provides personalized feedback about indi- 
vidual students. REACT is a real-time data-driven decision- 
support tool that incorporates explainability, interpretabil- 
ity, and portability. It presents different indicators of stu- 
dents, their learning processes, learning contexts, and rec- 
ommendations for increasing efficiency in course manage- 
ment. Based on the Learning Analytics Process model |38}, 
REACT may directly help educators in awareness, reflec- 
tion, and sense-making while it can indirectly create impact 
and motivate to take actions. This functionality can support 
educators in making decisions concerning their course plan- 
ning and instructional goals setting, by consistently moni- 
toring students’ activities (tests, quizzes, and exercises) to 
inspect the their learning process. 
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The remainder of this paper is structured as follows. Sec- 
tion [2] presents the background of EDM and how clustering 
can be useful in this context. Section [3] describes the archi- 
tecture of our tool and explains how clustering is utilized 
in REACT. Section [4] presents the details on the design of 
the user interface and details of the demonstration. Finally, 
Section [5]concludes the paper discussing directions of future 
research. 


2. BACKGROUND 


EDM enhances the decision-making of teachers, students, 
and educational institutes by utilizing data mining tech- 
niques in an educational context . A variety of methods, 
including cluster analysis, outlier detection, text mining, rec- 
ommendation systems, and visualizations can be applied in 
the EDM domain [32]. Cluster analysis or clustering is the 
most well known unsupervised machine learning task [24]. 
In the context of EDM, identifying meaningful clusters of 
students can be useful in understanding their learning be- 
te EA [22]. Clustering consists of four 
steps |40|: (1.) Feature extraction and selection — Relevant 
features are selected from the data and transformed into an 
appropriate format. (2.) Algorithm design — A suitable clus- 
tering algorithm and (dis)similarity measure are selected. 
(3.) Evaluation — Different clustering results can be evalu- 
ated using different metrics such as external, internal and/or 
relative indices. (4.) Explanation - The main purpose of clus- 
tering is to generate knowledge, useful for decision-making. 
This is conveyed to the user in different ways such as visu- 
alizations, textual feedback, or statistical metrics. 


Hierarchical clustering organizes the data points in a tax- 
onomy tree of clusters and sub-clusters [37]. Thus, it is 
suitable for detecting clusters of arbitrary shape, type and 
hierarchical relationships [40]. Hierarchical clustering has 
been shown to provide good results for small datasets [1], 
which is useful for typical class sizes. Furthermore, the en- 
tire clustering process can be visualized by plotting a den- 
drogram, which shows the cluster-subcluster relationships, 
similarity between clusters, and the order in which they are 
merged [37]. This results in an informative visualization 
of the data clustering structures [40], fulfilling the goal of 
explainability. There are two basic approaches to Hierar- 
chical clustering — agglomerative and divisive [25]. The ag- 
glomerative hierarchical clustering is a bottom-up approach 
that starts with each points being an individual cluster, and 
merges the closest pair of clusters at each step. The divisive 
method is top-down approach that starts with points being 
in one large cluster and progressively divides them. This is 
computationally expensive and not commonly used [40]. 
Thus, agglomerative hierarchical clustering makes a suitable 
choice for implementing cluster analysis on REACT. 


3. REACT ARCHITECTURE 

REACT is developed using the R Shiny framework which 
incorporates the principles of reactive programming [6| that 
are suitable for interactive applications. REACT is portable 
in a sense that it can be connected to any Learning Manage- 
ment System (LMS), like Moodle or Blackboard, as well as 
different database management systems, including MySQL, 
Oracle, Salesforce, etc. This can be achieved by using dif- 
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Figure 1: Architecture of REACT 


ferent packages such as DBI (for databases), bRuslf’| and 
seaaveal (for the Canvas LMS) which are available in R. Ad- 
ditionally, many other LMSs offer REST APIs which can be 
connected with REACT using htt{"] and jsonlitd?] packages. 


The architecture of REACT is motivated from RAED 
and it is shown in Figure [I] It consists of five main compo- 
nents: the Dashboard Engine, the Machine Learning (ML) 
Component, the Context Engine, the Contextualized Rec- 
ommendation & Alert Engine, and the Visualization Com- 
ponent. Due to space limitations, we focus on the compo- 
nent that implements clustering as an EDM technique, i.e., 
the ML component. 


The ML component receives input from a reactive data frame 
that contains the input features of each student and initiates 
the clustering process by first calculating all the pairwise dis- 
tances (i.e., dissimilarities) of students. We use the Gower 
distance as the dissimilarity metric for the clustering, as 
it can be applied to mixed data (i.e., a mix of numerical and 
categorical variables) in general [33]. However, for the pur- 
poses of this demonstration, we use as input features of each 
student the numbers of incorrect responses and the number 
of hints used per learning concept. For these numerical fea- 
tures, Gower uses Manhattan distance to calculate dissim- 
ilarity. The Dissimilarity Matrix sub-component calculates 
the pairwise distances between all n observations (i.e., stu- 
dents) in the data set organized in an n xn matrix, using the 
daisy() R function. This dissimilarity matrix then becomes 
the input of the Hierarchical Clustering sub-component. 


The Hierarchical Clustering sub-component trains four dif- 
ferent hierarchical clustering models using the same dissim- 
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Figure 2: The AI tab provides real-time insights of clustering 
textual template-based recommendations (top right) 


ilarity matrix. These models are based on four different 
linkage methods: Single linkage, Average linkage, Complete 
linkage, and Ward’s method. The R function agnes () is used 
for building these models and computing their agglomerative 
coefficients. 


The Model Selection sub-component ensures robustness and 
acts as an internal index for evaluations. It compares the 
four clustering results based on their agglomerative coeffi- 
cients. Their values lie between 0 to 1, and describe the 
strength of the corresponding clustering structure [21]. This 
sub-component selects the model with the highest agglom- 
erative coefficient. 


The Dendrogram sub-component creates a visualization of 
the hierarchy of clusters and sub-clusters that are the result 
of the selected model. This visualized hierarchy is called 
a dendrogram. The dendrogram provides a diagrammatic 
representation of the hierarchical cluster analysis. It can 
help to understand the clustering process which may help to 
incorporate explainability. An example of a dendrogram is 
shown in F igure [2] and discussed in the next section. 


4. DESIGN AND DEMO 


A dashboard can be defined as “an easy to read, often single 
page, real-time user interface, showing a graphical presen- 
tation of the current status (snapshot) and historical trends 
of an organization’s key performance indicators (KPIs) to 
enable instantaneous and informed decisions to be made at 
a glance” fra]. It is also common for decision-makers to use 
KPIs for understanding the performance or the deviation 
from the set target at a glance [28}. Thus, the user-interface 
of REACT is designed as an interactive dashboard, display- 
ing KPIs to help teachers monitor and understand their stu- 
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to instructors with visual explanation (dendrogram - top left) and 


dent’s learning performance. These KPIs include the min- 
imum, maximum, median and mean scores of the class, as 
well as the number of students who have completed all the 
questions of the in-class activity thus far. 


4.1 Design Elements 

A dashboard’s visual attraction significantly affects its per- 
ceived usefulness and its potential to bring change in users’ 
behavior [27]. The design choices of REACT were made 
with this consideration in mind. The selection of visualiza- 
tions is based on the chart suggestions provided by Abela 
and a review provided by Schwendimann et al. [34]. Fur- 
ther, its color palettes were selected so that it is colorblind- 
friendly [17]. REACT contains interactive visualizations and 
tables. Interactive applications need to ensure that they are 
easy to learn, and effective as well as enjoyable to use [30]. 
To ensure this, we aimed to follow the ‘golden rules’ of in- 
terface design proposed by Shneiderman et al. — strive 
for consistency, permit easy reversal of actions, keep users 
in control, and reduce their short-term memory load. 


The user interface of REACT currently has five tabs: 
Overview — presents the KPIs, an interactive plot for mon- 
itoring students’ performance, and textual alerts & recom- 
mendations for the instructors. 


Quick Analysis — presents an interactive plot that monitors 
the progress of students in real time, and bar charts that 
count the incorrect responses, and hints used for each KC. 


Scorecard — displays a histogram of the score distribution, 
and a dynamic table with students’ information and scores, 
both updated in real-time. 
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Figure 3: Creating a real-time demo of REACT 


AI — provides insights of the cluster analysis and textual 
template-based recommendations. F igure [2] shows a screen- 
shot of this tab. An instructor may use this tab to see on 
the dendrogram (top left) how different groups (i.e., clus- 
ters) of students are formed, based on their performance in 
in-class activities. Textual explanations about each of these 
groups are provided on the top right. Finally, to enhance 
interpretability, each of the different clusters is also visually 
explored at the bottom of the tab. The counts of incor- 
rect answers and the counts of hints used are displayed per 
Knowledge Component (KC) for each group of students. 


Public Health — provides context based on the COVID-19 
outbreak. It displays infection rates in the surrounding 
counties and counts of students who may live in high risk ar- 
eas, to inform educators, who may opt to transition online. 


4.2 Demo 

Holstein et al. note the importance of using real-world 
datasets to understand the behavior of LA tools. We use 
the 2009-2010 Skill-builder ASSISTments data set [16]. The 
raw data consists of more than 100,000 rows representing 
details of 4217 students and 111 Knowledge Components 
(KCs). To achieve the objective of this demonstration, we 
randomly selected a sample of 20 students. Due to privacy, 
this data set includes only pseudo-ids. In real-word uses of 
REACT, authenticated instructors will be able to see stu- 
dents’ names, as memorising their ids would be troublesome. 
Our approach to create a demonstration of REACT is shown 
in Figure [3] and can be summarized in the following steps: 


e Step 1 (Filter): We selected 20 students and two ques- 
tions from five KCs from the topic of statistics (Mean, 
Circle Graph, Venn Diagram, Box and Whisker Plot, 
and Scatter Plot). This filtered data set is first stored 
in a spreadsheet on a local hard disk. 


e Step 2 (Stream): The filtered data from Step 1 are then 
streamed on a Google sheet that acts as a database for 
this demonstration. It is connected to REACT using 
the googlesheets4°| package. 


e Step 3 (Use): REACT receives live updates from the 
streaming data. These concern hints that each student 
uses during the simulated in-class activity, as well as if 
they provided a correct or incorrect response to each 
question, as time progresses. These data are processed 
on the fly and used to update the visualisations, alerts, 
and recommendations displayed on the user interface. 


A live version of REACT lis deployed using the Shiny Server 
and it can be accessed using a web browser on any desktop, 
laptop, tablet, or smartphone. 


“https: //googlesheets4.tidyverse. org 
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5. CONCLUSIONS AND FUTURE WORK 


We presented REACT, a data-driven, visual, decision-support 
tool that incorporates model-agnostic explanations. This 
paper provides details on demonstrating a use-case scenario 
by utilizing the ASSISTments dataset. Our next step is 
to evaluate our proposed tool with the help of domain ex- 
perts, using a combined approach of think-aloud testing and 
questionnaires. This approach will help us to understand 
the usability and user experience while interacting with RE- 
ACT. The results from this combined approach can help us 
to identify directions of improvement in the interface de- 
sign and to propose the addition of new features. In the 
future, we aim to answer how the integration of AI and visu- 
alizations in real-time can impact the instructors’ decision- 
making process, and to what extent they do trust it. The 
answers to these questions will play a crucial role in mak- 
ing REACT a deployable tool that can enhance data-driven 
decision-making in education. 
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ABSTRACT 


This research presents a process for simplifying video la- 
beling and feature generation when building classification 
systems from real classrooms. Using video from a single, 
wide-angle recording of a live classroom, we create a low- 
level feature set of posture primitives built on keypoints from 
OpenPose. We use that feature set to build a posture recog- 
nition model of “natural labels” built from a scripted posture 
video using the same classroom. This model provides auto- 
matic labels for the real classroom data. We then derive a set 
of interpretable descriptors to characterize student-specific 
posture pattern dynamics. We show that those descriptors 
are able to discriminate between subtle differences in learn- 
ing activities in a real college classroom. 


Keywords 
classroom analytics, posture analysis, student activity recog- 
nition 


1. INTRODUCTION 


The field of research into classroom sensing technologies and 
data mining is growing. One goal of this work is to provide 
automated feedback to instructors about anything from la- 
tent states of the students to overt actions by the teacher 
[15]. The motivation for this work is usually to empower 
teachers and scaffold instructional development without al- 
ways relying on human consultants [13]. 


The promise of this field is high, but so are the costs. Tech- 
nical staff and software development are all expensive. Addi- 
tionally, labeling video data in order to derive insights about 
student interactions is particularly time-consuming and dif- 
ficult. This study describes an attempt to reduce that cost. 
We used a freely available posture analysis tool (OpenPose) 
to produce keypoint data for human postures which we then 
used to build a generic set of labels for a class of students. 
Our goal was to simplify both the application and inter- 
pretability of data labels. 
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2. RELATED WORK 


Emerging technologies for sensing pedagogical events in live 
classrooms include the detection of overt student behaviors 
(e.g., hand raising and gaze direction [1]), latent states (at- 
tention and engagement [16, 12]), and instructor actions 
(e.g., questions, activity sequences, gestures, and physical 
location in the room [3, 7, 11, 14]). Each approach has its 
own trade-offs in terms of reliability and effort required, but 
the models all require a dictionary of human-labeled body 
postures. Data annotation is time-consuming work requir- 
ing special expertise. For example, one must choose between 
coding in real-time [9, 10] or post-hoc [12, 16], and whether 
or not to use assisted label production [17]. 


Feature generation is a related but different concern. Ed- 
ucation researchers may want to build models on compre- 
hensible features, such as the words used during teachers’ 
questions [3, 14], or the gestures students and teachers ex- 
hibit during interactions [5, 4, 7, 12, 16]. While it is possible 
to use a “kitchen sink” approach to quickly assess the suc- 
cess of an algorithm and its inputs, education researchers 
may prefer to use features that can be observed and under- 
stood by the end-user. This way the instructors using their 
systems might be able to make changes based on the model 
output, e.g., [2]. 


3. MOTIVATION 


High-quality video cameras are ubiquitous. Researchers in 
education and machine learning can quickly generate large 
volumes of dynamic, rich data from the classroom. When 
turning video into data, education researchers traditionally 
code classroom videos using any number of methodologies 
[8]. Each approach for annotating and interpreting video 
data takes a significant amount of training and time. 


To this end we present the following case study in which we 
demonstrate a pipeline that requires only minimal resource 
investment on the part of the experimenter, including the 
time it would normally take to define, identify, and verify 
student gestures. We propose that this savings is possi- 
ble without sacrificing the interpretability of a human-coded 
feature set. To test the pipeline, we designed an easy and 
accessible feature generation strategy which we then tested 
against the most difficult in-class dataset we could imagine. 


Figure 1 illustrates our workflow, broken into the following 
stages: (1) We collect a video recording (scripted posture 
data, section 4.2) with synchronized, scripted posture pat- 
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Figure 1: Workflow 


terns commonly observed in classrooms; (2) We fit super- 
vised machine learning models to automatically recognize 
the scripted posture patterns (section 5.1); (3) Using video 
from a real classroom (section 4.1), we demonstrate the util- 
ity of those auto-estimated posture patterns by applying the 
posture detection models to the real classroom dataset to 
discriminate between class conditions (section 5.2). 


4. DATA SOURCES & AUTO-LABELING 


Here we describe our data collection and analysis. We con- 
ducted the study at Carnegie Mellon University in the Spring 
semester of 2019. We generated model data from a group 
of volunteers, and target data from a class of real students. 
All students signed IRB-approved consent forms. 


4.1 Real Classroom Data 

In selecting a use-case for our approach we chose a class 
that embodied a traditional lecture-based class. We worked 
with a semester-long graduate level course on “Applied Data 
Science” (ADS) in Spring 2019. There were 22 enrolled stu- 
dents, all of whom could fit in a single frame of a wide-angle 
camera (Marshall CV505). The camera faced the rows of 
students, and the instructor was not in frame. We recorded 
throughout the entire semester of bi-weekly, 75-minute ses- 
sion. We collected 22 sessions for a total of about 30 hours 
of class time. 


The format of the class was almost completely dominated 
by professor lecture. Halfway through the semester the stu- 
dents were put into groups for their final projects. During 
that second half of the semester, student groups took turns 
giving short presentations throughout the second session of 
each week. We used this naturally occurring difference in 
class format to inspire our classification problem. We thus 
generated two main class conditions: those led by the pro- 
fessor (i.e. Professor-Lead), and those led by groups of stu- 
dents giving project presentations (i.e. Peer-Lead). In the 
Professor-Lead condition (16 sessions), students listened to 
the professor lecture and were permitted but never required 
to ask questions. In the Peer-Lead condition (6 sessions), 
students listened to groups of peers take turns giving a short 
presentation describing their progress on an ongoing class 
project. After each presentation, all students were allowed 
to ask questions, and a random selection of students were 
required to ask questions for participation points. 


Our goal was to model generic student posture patterns 
as descriptors to discriminate between Professor-Lead and 
Peer-Lead. From a naive perspective, the postures of the stu- 
dents in each condition were virtually indistinguishable. We 
chose this objectively difficult classification problem in order 


Figure 2: A snapshot of scripted posture video in which volun- 
teers were performing one of the scripted action of Checking 
Phone 


to stress-test our approach. Our proposition was that stu- 
dents in either condition might have different internal states 
related to their expectation of learning useful information 
(Professor-Lead) vs. their potential requirement to ask a 
question (Peer-Lead), but that we would not have predic- 
tions about which gestures might reveal those latent states. 
This is a “good enough” test of our goal of building a prac- 
tical process that could eventually be of some potential use 
to researchers who are likely to test less fuzzy classification 
problems. 


4.2 Generic Student Posture Descriptors 

To address our goal of helping researchers create descrip- 
tors without deep, costly annotation, we designed an ap- 
proach that would create a catalogue of possible postures 
students exhibit in a typical class. We began by creating a 
7-minute video of 11 volunteers arranged in the same seats 
as the students from the ADS class. Using the same equip- 
ment as would be used in the real class, we led the volun- 
teers through a series of scripted movements. The “Scripted 
Posture Video” section of our workflow (Figures 1 and 2) 
comprises these data. The volunteers did not know what 
the prompts would be in advance, and their behaviors ap- 
peared natural. We guided them through 13 generic posture 
patterns: Checking Phone, Looking at Computer, Looking 
Down, Looking Up, Looking at front-left, Looking at front- 
right, Looking Left, Looking Right, Performing Q&A, Talk- 
ing to neighbor, Raising hand (left and right) and Writing. 
We chose this list as a comprehensive representation of ob- 
servable posture patterns in real classrooms when students 
listen to lectures. There are additional gestures we could 
have included, such as sleeping, eating, or drinking. How- 
ever, this seemed outside of the scope of training on only the 
most frequent and probable behaviors rather than trying to 
include every conceivable movement that might exist. 


Our next step was to produce an underlying set of low-level 
features for defining these generic posture patterns. Table 1 
is a partial list of the 24 frame-by-frame features we created 
from OpenPose keypoints [6]. OpenPose’ is a freely available 
toolkit for identifying physical landmarks, or “keypoints,” 
on human figures in a picture, as shown in Figure 3 (left). 
Each keypoint is part of a 2-D array of real-valued numbers 
(Figure 3, right plot). In this analysis we only use upper 
body keypoints, including head, neck, and arms. 


‘https: //github.com/CMU-Perceptual-Computing- 
Lab/openpose 
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Feature Name 
neck_nose 
Lshoulder_nose 
Rshoulder_nose 
rHand_nose 
lHand_nose 


Description 

neck nose distance 

left shoulder nose distance 
right shoulder nose distance 
right hand nose distance 

left hand nose distance 
nose_neck_h nose neck horizontal displacement 
nose_neck_v nose neck vertical displacement 
nose_shoulder angle nose neck angle w.r.t shoulder 
rElbow_angle right elbow angle 

1Elbow_angle left elbow angle 


Table 1: A partial list of low-level features used in the posture 
recognition machine learning model 


Figure 3: An example of OpenPose toolkit keypoints for a 
given frame of real classroom data (left); and the upper body 
keypoints used in our analysis (right) 


We designed the features from Table 1 based on our pro- 
fessional experience performing student observation and an 
analysis of how groups of keypoints move together to pro- 
duce gross postures. For example, the features rHand_nose 
and l[Hand_nose each measure the vertical distances between 
the nose and the hand. These distances can indicate verti- 
cal hand movements, e.g., as seen in hand-raising. “Nose- 
neck” related features (e.g. nose_neck_h or nose_neck_v) can 
indicate left/right head movements. With these low-level 
features in hand we then applied them to the scripted pos- 
ture data and train a random forest classifier for recognizing 
posture patterns. We compiled a training set with each data 
point representing a person-frame pair and labeled each data 
point with labels naturally available from scripted posture 
data. We then fit several independent binary classifiers, each 
predicting the binary label of whether a given posture pat- 
tern occurred. 


Our hope was that by having 11 different people perform the 
scripted movements in their own unique fashion, the model 
would be exposed to a sufficient amount of variability—such 
as one would expect to see in a the real world. We worked 
from the assumption that this would at least reduce the need 
for building and applying a precise annotation manual. This 
allowed us to quickly compile a posture-recognition model. 


5. RESULTS 


In this section, we present the frame-by-frame posture recog- 
nition model (section 5.1) using our generic behavior labels 
from the scripted posture dataset (4.2). We then applied 
that model to the ADS dataset (4.1), automatically labeling 
the posture patterns frame-by-frame. We derive descriptors 
from each 5-min segment of classroom video based on those 
machine labels, and built a classifier to discriminate between 


the two class conditions. The Peer-Lead segments of video 
did not include periods of question-asking after student pre- 
sentations. This was meant to maximize surface similarity 
between the two conditions and provide a challenging test. 


5.1 Posture Patterns Recognition 

Table 2 summarizes the Area Under Curves scores (AUCs) 
for binary classifiers predicting whether a given posture pat- 
tern occurred in the appropriate position. An AUC rating of 
0.50 is equivalent to chance, which means that check-phone, 
for example, does not have a reliable posture pattern as a 
composition of its keypoint structures. However, the model 
is able to identify head movement in left, right and up direc- 
tions somewhat more reliably than other types of subtle head 
movements, such as look-down, look-front-left or look-front- 
right. Hand-raising postures and writing are also found to be 
relatively easier to identify. Similar to check-phone, actions 
without clear movement patterns exhibit low performance, 
ie., look-at-computer, Q&A, and talking-to-partner. 


Posture AUC | Posture AUC 
Patterns Scores | Patterns Scores 
check-phone 0.51 look-up 0.80 
look-at-computer 0.63 | look-left 0.84 
look-down 0.53 | look-right 0.88 
look-front-left 0.67 | writing 0.81 
look-front-right 0.75 | raise-left-hand 0.81 
QandA 0.63 | raise-right-hand 0.84 
talk-to-partner 0.60 


Table 2: Area Under Curve (AUC) scores from binary classi- 
fiers each predicting whether or not a given posture pattern 
has occurred, from leave-one-person out cross-validation ex- 
periment. 


5.2 Discriminating Class Conditions 

In this section we report the results from testing the hy- 
pothesis that there are discernible differences in students’ 
posture patterns between the Professor-Lead and Peer-Lead 
class conditions. To answer this question, we formulated a 
machine learning classification task in which we used the 
descriptors derived from videos of the ADS class as input 
(right part of Figure 1). For output labels we used the class 
conditions. In creating the training dataset, we extracted a 
series of non-overlapping 5-minute segments from the real 
classroom videos and computed a list of statistics based 
on predicted student-by-student, frame-by-frame probabil- 
ities of posture patterns from the generic behavior model 
described in section 4.2. For each 5-minute video segment 
we derived five statistical values (mean, standard deviation, 
min,maz, and median) summarizing the predicted probabil- 
ities of each of the 13 posture patterns. As a result, we have 
a training dataset with 65 features (13 posture patterns by 
5 statistics) with each row representing a student-segment 
pair. We use random forest to fit the model. For compari- 
son, we also derived an independent set of low-level features 
using only the keypoint structures described in Table 1. 


We conducted two types of cross-validation experiments: 
random split and leave-one-session-out. In random split 
mode, the training and test datasets were constructed by 
random selection from the pool of 5-minute video segments, 
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irrespective of the class sessions to which they belonged. 
This design can yield relatively optimistic performance be- 
cause of the likelihood that the segments from the same ses- 
sion can appear both in training and testing. In the second 
experiment, the split is based on class session, which results 
in a more conservative measurement of discrimination. 


Table 3 shows the AUCs for model discrimination between 
Professor Lead vs Peer Lead using two different sets of input 
variables and under two different experimental conditions. 
The AUCs for each experimental design is beyond a random 
chance of 0.5. Specifically, we note that AUC decreases when 
using model-based descriptors compared to using a low-level, 
less interpretable feature set derived directly from the key 
points. The drop of AUC from random split to leave-one- 
session-out cross-validation suggests that certain predictive 
features are session specific and therefore make it difficult 
to predict labels for an unseen session. Future work will 
be of interest to identify those session specific features and 
investigate their roles in predicting class conditions. 


Random _ Leave-One 
Split Session Out 


Input Variables 


Posture-Model-based descriptors 0.72 0.64 


Keypoint-based features 0.82 0.68 


Table 3: Area Under Curve (AUC) scores for experiments 
to discriminate between Professor-Lead and Peer-Lead 
class conditions, comparing random split and leave-one- 
session-out cross-validation designs. 


In order to understand the features that contribute to dis- 
criminating between the class conditions, we reviewed the 
feature importance from the random forest model that used 
interpretable posture-model-based descriptors as well as the 
low-level keypoint features. Figure 4 shows a selection of 
important and unimportant features from each approach. 
As noted in the upper portion of Figure 4, the most impor- 
tant input variables in the posture-based model are those 
describing the variation of left and right head movements. 
Other posture patterns, such as looking to the front, rais- 
ing hands, and looking down did not play an important role 
in the model. The bottom portion of the figure shows that 
some of the important features were the distances between 
students’ eyes and the angles between their nose and shoul- 
ders. Some of the less important keypoint structures in- 
cluded the relative angles of the left shoulder and elbow, as 
well as the distance between the left shoulder and the nose. 


6. DISCUSSION 


In this project we explored methods for extracting posture- 
related descriptors from videos of students in a real class- 
room. We derived the posture labeling model from a video 
of volunteers following scripted prompts. We extracted key- 
point data from the videos using OpenPose, a freely avail- 
able general purpose posture keypoints detection tool. We 
then showed that this method of automatic labeling could 
distinguish between two highly similar class conditions. 


Large body movements such as hand raising and left/right 
head shifting were the easiest for the model to detect, and 
the most important descriptors in the posture model. In 
terms of using labels that are easy to interpret, these types 


Metrics MAX MEAN MEDIAN MIN SD 
down I 
front l | 
hand | I 
left _ a 
| | 
| 
| 
| 
| 
I 
| 
| 


Descriptors 


right 

D_eye [a | 
nose_shoulder_angle 
nose_shoulder_proj 

nose_neck_h 

1Shoulder_angle 

IShoulder_nose 

IElbow_angle 


Low-level 
feature set 


Figure 4: A selection of posture-based descriptors (white) and 
low-level features (gray) and their importance in a Random 
Forest model for discriminating between class conditions. 


of movements seem like a promising start. Without trying to 
interpret those movements at this time, they were at least 
important to the posture model. It maybe the case that 
simply informing an instructor about these movements could 
be a productive starting point for reflection. 


Given that there were a number of features that did not 
contribute to the models, and that the raw keypoint model 
performed better than the derived model, we note that there 
is a trade off between the accuracy of this approach on the 
one hand, and its interpretability and transferability on the 
other. When we look at the variance in Figure 4, we see indi- 
cations that the importance of some features (and the lack of 
importance of others) is more interpretable than the power 
of different keypoint angles and vectors. These higher level 
features say something about what students do differently 
in different scenarios. Our point here is not to deduce what 
those meanings are, but to show some of the student behav- 
iors that are worth noticing. In terms of transferability, the 
fact that we built these labels from a 7-minute session of 
non-student volunteers shows that this approach may have 
some potential as a one-to-many label generation method, 
at least when the volunteers use the same classroom as the 
target students. 


Finally, we propose that the pipeline we explored in this 
project, from feature generation to auto-labeling and from 
data prepossessing to feature extraction, can all be gener- 
alized to other teaching and learning scenarios in physical 
classrooms. For researchers in this space, i.e., developing 
classroom-based technologies for sensing behavior and pro- 
viding automated feedback, our study may help simplify and 
accelerate their work by simplifying annotation and antici- 
pating features that the end-user can understand. 
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ABSTRACT 


Cross-validation is a wide-spread approach to understand how well 
a prediction model performs with unseen data. While this is the 
state of the art, machine learning is often used for educational 
purposes in educational data mining. Whether a system is 
applicable and generalizable in practical settings is based on the 
cross-validation accuracy. One major problem is that the quality of 
annotated data is often worse due to different raters that score equal 
tasks differently, even if they were trained before. In this paper, we 
did an experiment where 1.200 texts of three difficulty levels in an 
open writing task for language learning were scored by two tutors 
independently to get the inter-rater reliability score for measuring 
the similarity across their grades. We used the existing scorings of 
other tutors of the system to train a random forest regressor for 
predicting scorings based on the texts. We found out that the 
accuracy has a strong relationship to the inter-rater reliability score 
and propose a new measurement that combines both metrics for 
scenarios where data was annotated by tutors, that could principally 
be diverse. 


Keywords 


Tutoring systems, scorings, data labeling, inter-rater reliability, 
cross-validation 


1. INTRODUCTION 


As long as tutor scorings are used as a basis to train machine 
learning systems, there is a bias of subjectivity. Research has shown 
that the agreement among scores given by tutors often varies [1]. 
Depending on the task and scale, tutors reach different inter-rater 
reliability scores. Practical settings have shown that even when 
teachers were trained for grading, there is a gap. Thus, formal 
exams are often graded twice and in case that there is a huge gap, a 
third grader needs to be taken into account. For the field of machine 
learning, we need thousands of scored tasks, e.g. for automated 
essay grading. From the practical point, it is understandable that 
scorings cannot be done by the same tutor all the time. Tutors’ time 
is a limited resource and thus there is the need to score tasks by 
different experts. If we consider machine learning approaches, 
there are many examples of prediction tasks, where researchers try 
to imitate teacher scorings, based on different features. As the 
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reduction of a text or task to features removes information that 
could be important for a good evaluation, automatic scorings 
cannot be perfect. Using data gathered by tutors where even 
scorings for the same texts or tasks are not always equal we think, 
that it is not fair to compare the prediction accuracy in education in 
general if we use tutors’ labeled datasets. 


In machine learning, the proper way to decide whether a system 
generalizes well is to do cross-validation [2]. Therefore, the data is 
split into several pieces. The model will be trained on all the data, 
except from one piece. This piece is used to evaluate the model as 
we know features and the concrete label. Based on the features, the 
system creates a prediction using the trained model. The predicted 
label can be compared with the known one. With every piece, the 
leave-one-out method (or alternative ones) can be applied to get an 
averaged accuracy. The main advantage of this method is to create 
a prediction on previously unseen data. Thus the evaluation shows 
whether a model generalizes well. Observing this value in detail, 
we often notice that the accuracies are between 0.6 and 0.8, e.g. 0.6 
for 8 classes and 0.78 for 4 classes in [3] or 0.7 in for 4 classes in 
[4]. From the perspective of machine learning, these are bad values 
as it means that 3-4 of 10 predictions are wrong. 


To have a good and fair measurement for comparison it is necessary 
to take the inter-rater reliability of human raters into account as in 
general, the prediction cannot be better than the ratings among 
raters that have been used for training the machine. The inter-rater 
reliability is a score of consistency among raters. According to 
McGraw & Wong [5], the minimum value should be 0.6 as the cut- 
off for acceptability. Wang & Michelle [6] did a comparative study 
to compare human essay scoring and reached an inter-rater 
reliability score (IR score) of 0.62, using the Intraclass Correlation 
Coefficient. Williamson proposes that an IR score lower than 0.7 is 
not applicable [7]. 


The accuracy of predictions is often measured as the comparison of 
the prediction of the machine and the rater annotations. But the 
machine itself was trained based on the raters scores, which could 
differ among raters [8] [1]. It is not surprising that the predicted 
scorings by machines correlate with the human rater scorings as 
they are the training base [6]. In contrast, Williamson has shown 
statistically significant differences between human and machine 
rating scores [7]. The question remains: comparing all the systems, 
what is the best and most applicable one? Using the accuracy only 
fails as the major problem is the quality of the training data — and 
not the resulting accuracy in cross-validation. 


In this paper, we propose an extension of the cross-validation to 
have a fair measurement for comparing educational predictions, 
where training data was gathered from tutors. We focus on 
language learning and examine two research questions: 
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RQI1: What is the correlation of the inter-rater reliability score in 
essay scoring for language learning, compared to the prediction 
accuracy? 


RQ2: Combining the cross-validation with the inter-rater reliability 
score, what is a fair interpretable measurement taking both metrics 
into account? 


2. METHODOLOGY 


To address RQI, two tutors had the task to score open text 
submissions of three tasks. All tasks had a different difficulty level, 
easy (1), medium (2), and difficult (3). For every task, we had 400 
user submissions, in sum 1.200. Both tutors got access to the tasks 
and they got 10 typical scorings for a pre-training. Then, all 
submissions were scored by both instructors independently of each 
other, using scores of 1 (very good) to 4 (bad/not acceptable). The 
scoring procedure lasts 1 week for every tutor. 


Then we prepared a random forest regressor [9] as a classifier to 
train a prediction model for essay scoring based on at least 1.200 
scorings for each task, that are already existing in the learning 
system, independently of the scorings from the previous step. These 
scorings are created by different tutors, where each text was scored 
only once. So we did not use the data of the previous step for a 
comparison to avoid training with the new labeled dataset. From 
practical settings we know that intermediate grades are quite 
subjective, thus we concentrate on grades 2 and 3 only, which 
represent “good” and “satisfactory” that are used as labels for the 
classification problem. The accuracy for prediction in cross- 
validation (CV) was gathered for each task separately. 


Within the next step, we compared the similarity among both tutors 
of the first step with the accuracy of the second step to examine a 
possible relation. Finally, we propose a combination of both 
metrics that allow a fair comparison of the prediction accuracy with 
the IR Scores to address RQ2. 


3. RESULTS 


Figure 1 shows all IR scores and the prediction accuracy in a 10- 
fold CV. We can see that there is a good correlation between these 
metrics (correlation 0.88). We used the same approach for all tasks, 
but the IR-scores vary from 0.45 to 0.74, and the accuracies in 10- 
fold CV range from 0.47 to 0.64. The results show that the 
accuracy, as well as the IR-score, vary depending on the task. But, 
there is a strong positive relationship between the maximal 
achieved prediction accuracy and the IR score. 
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Figure 1. Inter-rater reliability score of two tutors for 
three tasks, separated by increasing difficulty level and 
prediction accuracies in cross-validation. 


4. NEW MEASUREMENT 


The main idea is to combine the classical cross-validation with the 
inter-rater reliability score. The CV addresses the accuracy of a 
trained prediction model. As there are multiple versions of the CV, 
e.g. leave-one-out or leave-p-out (where p is a range of the dataset), 
we use CV as a general concept and do not limit our approach to a 
specific version. 


The similarity of tutor scorings can be measured by using a 
correlation coefficient. We chose the Pearson correlation 
coefficient (PCC) for applying to a sample [10]. It is not outlier 
resistant [11], but in the area of learning, large gaps can principally 
occur in ratings, e.g. the score from one rater is “very good / 1” and 
from another, it is “very bad / 4”. This will impact the resulting 
correlation coefficient. For our new measurement, this is important 
as this gap influences the training data as well and thus, it influences 
the prediction accuracy negatively due to a large bias. 


We propose a combination of both metrics, namely CVpcc, defined 
by the following formula: 


CVpcc = 1— |CV — PCCt| 


under the constraint 0 < CV,PCCt <1. The CV accuracy is 
defined as a number between 0 and 1 [2] and PCC* = |PCC| as we 
only consider the similarities, not whether the PCC is positive or 
negative. The CVpcc is a new value _ where 
CVpcc € [0,1], similar to the CV. In the following paragraph, we 
show that CVpcc cannot be smaller than 0 and never more than 1. 
Let CV and PCC* be defined as above. Then we examine whether 
AcV,Pcct: 1- |CV-—PCCt| <0 or 1- |CV—PCC*|>1. 
1- |CV-—PCCt*| <0 @=1< |CV-PCCt| 
With PCC*t >Owe set PCC* =Oto maximize the value for 
|CV — PCC*|. As 0 < CV < 1, the maximum value for CV is 1. 
This follows: 1 < |1 — 0] = 1 < 1, which is a contradiction. 
1-— |CV-—Pcc*|>1 
=1>1+ |CV—-PCc*| 
= 0>|CV —- PCCct| 


The absolute value x is defined as Vx € R: |x| > 0 [12]. Thus, with 
x = CV —PCCt:0> |x| & |x| < 0. According to the definition 
of the absolute value, this is not existing in R. Finally, we showed 
the second contradiction and can conclude that CVpcc € [0,1]. oO 


In Figure 1 we can see that the CV accuracy, as well as the IR 
scores, range from 0.44 to 0.74. If we just compare the CV 
accuracy, we can conclude that there is a high fluctuation. Using 
the new CVpcc, the scores range from 0.89 to 0.96. Here, the 
fluctuation is much lower and we now can compare this value with 
other tools and different datasets. 


5. DISCUSSION 


In general, we know that having a low inter-rater reliability score is 
an indicator of a bad quality of training data. Although we used the 
same amount of data to train the classifier for each task we can 
observe that it is not fair to compare the achieved accuracy in 
prediction only. As there is a strong positive relationship between 
the accuracy and the inter-rater reliability score we propose to 
combine both metrics when comparing the result with other 
datasets. Otherwise, that is what our results show, the accuracy 
differs across tasks, and results are based on the task selection. 


In an optimal setting, where all scorings are the same for equal texts 
across different raters (PCC*t = 1), follows CVpcc = CV. 
Observing the other “extreme” side, where the PCC* and the CV 
values are very low, we can still achieve a high CVpcc, as the 
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accuracy will be low if labels are diverse for equal feature values. 
The higher the range between PCC* and CV is, the lower CVpcc 
will be, which means that the relation between both metrics is low. 
With that information we address RQ2. Thus, this is an indicator of 
whether the model needs improvement or whether the accuracy 
cannot become better as the training base has a low quality due to 
diverse labeling based on different quality expectations of tutors. 
This interpretation of the value can be helpful to optimize the 
model. As we use cross-validation as a general metric, our approach 
is not limited to specific classification methods. We used the 
random forest regressor, but we can use other classification-based 
methods like neural networks, support vector machines, or others 
as long as we get access to the CV score. 


We need to emphasize that our method requires a further labeling 
step to get the inter-rater reliability score across at least two tutors, 
where each text needs to be labeled twice. This increases the 
labeling costs. To reduce the amount of work, we could principally 
use a subset of already labeled texts that has to be labeled by a new 
tutor to understand the data quality. If a low value will be detected, 
we know that the resulting accuracy will differ from experiments 
with other datasets due to the low agreements. We can argue that 
knowing the problem of diverse scorings is a good fundament to 
optimize further scorings by a better pre-training of raters. But in 
praxis, often thousands of labels are existing based on the data that 
was collected over the last years. Thus, only for future data 
collection, there can be optimization. If we want to use existing 
datasets, we propose to use the CVpcc for a fair comparison in 
relation to other datasets. 


In our experiments, we used two separate datasets, one that contains 
the scorings of the two tutors and one much larger set, where more 
texts were scored by other tutors. The first was used to get the 
PCC* score and the other to train the classifier based on the 
maximum achievable CV score. To benefit from the extra labeling, 
we could enhance the training dataset by the data where the two 
tutors had equal scorings for the same texts. 


Our proposed metric is limited to datasets that were annotated 
manually. If we have labels that are automatically processed (e.g. 
the achieved scores in interactive tasks in an online course or 
whether a student drops out), normally we do not have a diverse 
annotated dataset. Thus we recommend using the CVpc¢ in all 
scenarios where tutors are involved and where diverse annotations 
(e.g. in scorings) play a role. This is early-stage research, limited to 
three difficulty levels of specific open-writing tasks. To generalize 
our findings, the next step is to compare more tasks and the 
resulting CVpcc. Besides, further studies in other learning domains 
are required to verify the found relations of the metrics. Our first 
findings are promising. 


6. CONCLUSION 


In this study, we examined the relation of the inter-rater reliability 
of tutor scorings and the accuracy that can be achieved to predict 
two concrete ratings. In our setting of language learning, we 
focused on three open writing tasks of different difficulty levels, 
those accuracies in prediction differ. Based on our results, we 
observe that there is a strong relationship between both scores, even 
though both metrics were derived using datasets from multiple 
raters. Thus we can see that datasets, labeled by tutors, can differ. 
This infers the data quality and the maximum achievable accuracy 
in prediction. To use possibly diverse annotated data by tutors and 


for comparing the prediction results, we propose a new method of 
combining both metrics to allow fair comparison across different 
datasets. This new metric can help scientists in educational data 
mining to compare results of different tutor-based labeled datasets 
and it helps to understand whether a model or the dataset needs 
improvement. 
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ABSTRACT 


Automatic Question Generation seeks to generate questions 
about a given text for educational purposes such as testing 
students’ comprehension processes while reading. This pa- 
per focuses on the task of predicting the next sentence as a 
way to exercise and assess a crucial skill that comprehension 
questions often fail to test, namely relating sentences to the 
context preceding them. We train a BERT-based model of 
text coherence to estimate the probability that a given sen- 
tence will come next in a story. It achieves 68.4% AUC on 
a held-out test set, significantly above chance. We define an 
easiness score as the difference between the estimated prob- 
abilities of the next sentence and (the likelier of) two dis- 
tractors, namely the two subsequent sentences. We evaluate 
our model on data from Project LISTEN’s Reading Tutor 
by correlating the easiness scores of 1,023 questions against 
the percentage answered correctly by 274 children. A strong 
correlation would make it possible to filter such questions by 
difficulty for children at a specified reading level. Unfortu- 
nately, the easiness scores of the questions did not correlate 
with the correctness of children’s answers to them. 


Keywords 

Automatic question generation, difficulty prediction, next- 
sentence prediction, reading comprehension assessment, nat- 
ural language processing, BERT 


1. INTRODUCTION 


A crucial skill in reading comprehension is inter-sentential 
processing — integrating meaning across sentences. It in- 
volves analysis of cohesive relationships such as coreference, 
indirect reference, and ellipsis [3]. Inter-sentential processing 
is hard for young readers partly because it requires assim- 
ilation from short-term memory to mid-term memory [12]. 
Unfortunately, reading comprehension questions often fail to 
assess inter-sentential information integration [1, 13, 14]. 


Next-sentence prediction questions are a natural way to test 
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Context: Everyone knows that the elephant has a very long nose. But a long time 
ago, the elephant's nose was short and fat. Like a shoe in the middle of its face. 


Does this sentence come next? Which sentence comes next? 

—She had a question for every animal. | —She was curious about everything. 
+One day a baby elephant was born. 
—She had a question for every animal. 


Figure 1: Two forms of next-sentence prediction questions. 
Answers in green are correct and answers in red are incorrect. 


inter-sentential processing and are easy to generate. They 
are also easy to score, because by definition the correct an- 
swer is the next sentence. One form of such questions is 
true/false, i.e., “Does this sentence come next?” Another 
form is multiple choice, i.e., “Which sentence comes next?” 
This form has a higher cognitive load because it requires con- 
sidering multiple sentences, but may be easier than judging 
a single candidate sentence by itself. Figure 1 shows both. 


Although easy to generate and score, next-sentence predic- 
tion questions can be hard to answer correctly. For exam- 
ple, one study [2] randomly inserted “Which sentence comes 
next?” questions in children’s stories, with the next three 
sentences of the story in random order as the choices. Chil- 
dren answered only 41% of these questions correctly, barely 
above chance and frustratingly low. 


Good questions should be challenging but not frustratingly 
hard. Therefore, difficulty control is important in automatic 
question generation. However, despite the rapid develop- 
ment of question generation, little work has analyzed the 
difficulty of automatically generated questions [9], especially 
for reading comprehension [6, 7, 16], and none of it addresses 
next-sentence prediction questions. 


This paper addresses the difficulty of such questions, and is 
organized as follows. Section 2 describes how we trained a 
coherence model to estimate the probability that a sentence 
comes next given the preceding context, and how we used 
it to score question easiness. Section 3 evaluates this model 
on a corpus of children’s stories. Section 4 correlates the 
easiness scores of the questions against the percentage of 
children who answered the questions correctly. Section 5 
concludes. 


2. COHERENCE ESTIMATION 


To estimate the coherence between a given context and sen- 
tence, we fine-tuned a BERT-based binary classification model. 
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95% | IsNext 
5% | NotNext 


0.95 <—— Coherence Score 


FFNN + Sigmoid 


Context + [SEP] + Candidate Sentence 


Figure 2: Architecture of the BERT-based model for coher- 
ence estimation. 


BERT [5], a widely used Transformer-based language model, 
has achieved state-of-the-art performance on a large suite of 
natural language processing tasks. The blue box in Figure 2 
shows the architecture of the pre-trained BERT model. To 
do classification, it appends a 2-layer feed-forward neural 
network (FFNN) to the BERT model, followed by a sigmoid 
function to scale the FFNN’s output between 0 and 1. 


BERT was pre-trained on BooksCorpus (800M words) [17] 
and English Wikipedia (2,500M words) with two objectives. 
First, randomly masking various words in a text and pre- 
dicting the masked words from the surrounding text forced 
BERT to embed each word based on the surrounding words. 
Second, predicting whether one sentence follows another sen- 
tence in the original text forced BERT to learn inter-sentential 
coherence. Thus these two objectives prompted BERT to 
learn both intra- and inter-sentential semantic structure. 


The effect of the next-sentence prediction task in pre-training 
has recently been questioned [4, 8, 15]. Some researchers be- 
lieve that BERT actually learns inter-sentential topic simi- 
larity rather than coherence, because its negative instances 
are sentences sampled randomly from the entire text corpus, 
which are likely to be topically unrelated to the context. 


We now describe how we adapted the BERT-based model 
to estimate inter-sentential coherence in children’s stories. 


Input: We fine-tuned the pre-trained BERT-based model on 
input token sequences of the following form: 


e a special token [CLS] used for classification tasks 


e three sentences of context, which we assume suffice to 
capture the semantically relevant content. Any more 
might include irrelevant information or exceed BERT’s 
input length limit of 512 word pieces (i.e., roots and 
morphemes). 


e a special separator token [SEP] 


e a candidate next sentence; for positive instances, the 
sentence immediately following the context. 


Selection of negative instances: We wanted the task to test 
children’s judgment of inter-sentential coherence, not merely 
topical relevance. Therefore, rather than sample negative in- 
stances randomly from the entire corpus, we selected them 
from the same story, specifically the 2 sentences immedi- 
ately following the correct sentence, which are likelier to be 
topically relevant to the local context than sentences from 
later in the story. Using the 3 sentences following the con- 
text as the multiple choice candidates also matched the task 
performed by the children in our evaluation dataset, to be 
described in Section 4. 


Human experts could presumably pick contexts and distrac- 
tors more judiciously to test children’s judgements of inter- 
sentential coherence. However, such manual selection is nei- 
ther economical nor scalable. One goal of this work was to 
identify requirements for choosing better contexts and dis- 
tractors so as to improve automated selection. 


Positive-negative ratio of training instances: BERT was pre- 
trained on equal numbers of positive and negative instances. 
In contrast, the 3 candidate sentences after the 3-sentence 
context included one positive instance and two negative in- 
stances. 


Training labels: To fine-tune BERT and train the FFNN, 
we set the output of the combined model to 1 for positive 
instances and 0 for negative instances. 


Easiness scores: To measure each candidate sentence’s coher- 
ence with the given context, we used the probability output 
by the sigmoid function. Given this measure of coherence, 
we used a simple heuristic to rate the easiness e of answering 
a 3-choice question: 


€ = Cpos — MAX(Cnegy » Cnego) (1) 


Here Cpos is the coherence of the correct answer, and Cneg; 
is the coherence of distractor neg;. This formula assumes 
that the difficulty of the question depends on whichever dis- 
tractor is more coherent with the context. (As a reviewer 
suggested, we also tried the log ratio of the two coherence 
scores instead of their difference, but it performed the same 
in the evaluation reported in Section 4.) 


Figures 4 and 5 show example questions with easiness scores 
of 0.244 and -0.262, respectively (see Appendix). A negative 
easiness score occurs when a distractor has greater coherence 
than the correct answer. 


3. EVALUATION OF COHERENCE MODEL 


We now evaluate how accurately our coherence model classi- 
fied the 3 sentences following a 3-sentence context as IsNext 
or NotNect. 


3.1 Text Dataset 


We constructed a dataset for fine-tuning and evaluating our 
coherence model from a corpus of English-language chil- 
dren’s stories from two sources: 


722 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


Table 1: Examples of Cases Removed by Data Cleaning 


Type Context Correct Answer Choices 
two identical choices ...Did the frog slip? Yes. <Yes.> <Yes.> 
<The frog swam fast.> 
one choice appearing ... Yes, Yes. <Yes.> <The frog swam fast.> 


in the context 


<It went past Pat.> 


very short context Pop can twist and bend. 


Pop slips! Pop stops. 


Pop sits. 


<Pop sits.> <Pat slaps Pop’s hand.> 
<Pop must rub his feet!> 


unfinished sentence in the 
context /content about 
phonics instruction 


real meat peak 


What sound do the letters 
e a make in the words 
real, meat, and peak? 


<What sound do the letters e a make 
in the words real, meat, and peak?> 
<near> <leap> 


e 337 stories from Project LISTEN’s Reading Tutor [2], 
totalling 39K words with a vocabulary of 8K distinct 
words, at grade levels K-7. 


e 354 stories from www.africanstorybook.org totalling 
91K words with a vocabulary size of 11K, with page 
lengths ranging from one word to multiple paragraphs. 


For fine-tuning and evaluation, we split the 337 LISTEN 
stories into three subsets, with 60% for training, 20% for 
hyper-parameter tuning, and 20% for testing, so as to ensure 
that stories in the test set were not seen during training. We 
used the African Storybook stories to augment the training 
set. 


For every story in the corpus, we used a 6-sentence slid- 
ing window to generate next-sentence prediction items of 
the form ([3-sentence context; IsNext sentence; Not- 
Next sentence; NotNext sentence]), with the correct (Is- 
Next) sentence and two (NotNext) distractors to be pre- 
sented in random order. 


To clean the data, we filtered out several cases (illustrated 
in Table 1): 


e Cases with two identical choices or a choice appearing 
in the context: typically caused by repeated sentences 
in a conversation. 


e Cases with context or a candidate sentence exceeding 
125 words: might cause the input sequence to exceed 
BERT’s input length limit of 512 word pieces. 


e Cases with very short context: typically caused by 
short sentences in a conversation that provide too little 
information to predict which sentence belongs next. 


e Cases with an unfinished sentence in the context: for 
some poems or phonics instructions, sentences were 
not segmented according to sentence separators. 


e Cases about pronunciation or spelling: are not relevant 
to semantic coherence. 


e Cases with the same context followed by different sen- 
tences: may confuse the model during training. 


As a result, we got a dataset consisting of 10,761 instances 
for training, a development set of 1,716 instances for hyper- 
parameter tuning, and a test set of 2,340 instances for eval- 
uation. 


3.2 Training 

To fine-tune our coherence model, we used BERT tase [5] as 
the backbone, and the AdamW optimizer [10] with a ini- 
tial learning rate of le-3 and a ReduceLROnPlateau sched- 
uler’. We used a ReLU [11] activation in the hidden layer 
of the FFNN, and set the dropout probability of this hidden 
layer to 0.5. We trained the model with a standard binary 
cross-entropy loss function weighted by the positive-negative 
sample ratio of 1:2. 


In contrast to pre-training BERT’s hundreds of millions of 
parameters from scratch, fine-tuning the BERT-based coher- 
ence model was inexpensive. It took only about 5 minutes 
on a single Tesla-V100 GPU to optimize the parameters on 
the training set. 


3.3 Evaluation Results 

Table 2 evaluates the coherence model on the development 
and test sets using various metrics: accuracy, weighted- 
average precision, recall and Fl1-score, and area under the 
ROC curve (AUC). To evaluate metrics other than AUC, 
we set the classification threshold to 0.5 and compared the 
predicted label with the ground truth label. In other words, 
we classified an instance as IsNezxt if the output probabil- 
ity (coherence score) exceeded this threshold, otherwise as 
NotNext. AUC measures the entire area beneath the ROC 
curve, which plots true positive rate vs. false positive rate at 
different classification thresholds. AUC evaluates the overall 
performance of a classification model by aggregating across 
all possible classification thresholds. 


Table 2: Evaluation of the Coherence Model 
Dataset | Accuracy Precision Recall Fl-score AUC 


Dev 0.608 0.663 0.608 0.620 0.662 
Test 0.609 0.679 0.609 0.619 0.684 


4. EVALUATION ON CHILDREN’S DATA 


We evaluated our easiness scores by correlating them against 
274 children’s performance on next-sentence prediction ques- 
tions. These questions were inserted randomly by the spring 
2003 version of Project LISTEN’s Reading Tutor into 179 
English-language stories ranging from grades 3-7. None of 
these stories were in the dataset used to train the coher- 
ence measure used to score easiness. The questions asked 
“Which will come next?” and presented the next three story 


‘https: //pytorch.org/docs/stable/optim.htmltorch.optim.Ir 
_scheduler.ReduceLROnPlateau 
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Figure 3: Percentage correct binned by easiness score. 


sentences in random order. After data cleaning, we got 1,023 
distinct questions with 1,626 responses, of which 45.7% were 
correct. 


344 of these questions had choices with differences in capital- 
ization, as illustrated in Figure 6 (see Appendix). Children 
might conceivably have used these differences as a clue to 
eliminate incorrect choices. However, their 622 responses to 
choices capitalized differently had virtually the same (in fact 
slightly lower) percentage correct (45.5%) as their 1004 re- 
sponses to choices capitalized the same (45.9%). Evidently 
children did not make use of this clue. Accordingly, we did 
not exclude these 622 responses from our dataset. 


The questions averaged only 1.59 responses each, far too 
few to reliably estimate the percentage correct for individ- 
ual questions. Instead, we split questions by easiness scores 
into N bins with equal numbers of questions. For N=10, % 
correct ranged from 41.2% to 52.5%. Figure 3 shows a bar 
chart with a bar for each of the 10 bins, its average easiness 
score to its left, and its % correct as its width. The % cor- 
rect was similar across all 10 bins and unrelated to easiness 
score. We tried various values of N, ranging from 3 to 128 
questions. For each value of N, we correlated the average 
easiness score of the questions in each bin against their per- 
centage of correct responses. The correlations got weaker as 
N increased, and were not statistically significant. 


To explore why, we regressed response correctness against 
several features of questions, namely the length and contex- 
tual coherence of the correct answer and the two distractors, 
the length (in characters) of the context, the position of the 
question in the story (the number of sentences preceding it), 
and the grade level of the story. We normalized the value of 
each feature x as (x — x_min)/(x_max — x_min). We per- 
formed logistic regression with the normalized feature values 
for each question as numerical inputs and the correctness of 
the child’s response as binary output. None of the regression 
coefficients differed significantly from zero. However, their 
general pattern makes qualitative sense. The contextual co- 
herence of the correct answer was the strongest positive pre- 
dictor, which makes sense because it measures how well the 
answer fit the context. The coherence of the harder dis- 
tractor was the strongest negative predictor, which makes 
sense because it measures how well that distractor fit the 
context. The length of the correct answer and the number 
of preceding sentences in the story were positive predictors, 
which makes sense because they measure the amount of in- 


formation provided for selecting the correct answer. Context 
length and the grade level of the story were negative predic- 
tors, which makes sense because reading longer sentences 
and higher level stories was harder (though better readers 
read harder stories). 


5. CONCLUSIONS 


This paper addresses two hypotheses regarding the use of 
next-sentence prediction questions in assessing children’s inter- 
sentential processing during reading comprehension. 


Hypothesis 1: An automated measure of text coherence can 
predict which of the next 3 sentences will come first. To test 
hypothesis 1, we trained a BERT-based model of a sentence’s 
coherence with the preceding context to predict whether it 
comes next. It achieved 61% accuracy on a held-out test set. 


Hypothesis 2: An easiness metric based on this measure can 
predict children’s accuracy in selecting the next sentence. To 
test hypothesis 2, we scored the easiness of the 3-way choice 
as the coherence of the correct next sentence minus the co- 
herence of the strongest competitor. We then related this 
score to children’s performance on 1,023 such questions pre- 
sented by Project LISTEN’s Reading Tutor to the children 
while they were using it. There was virtually no correla- 
tion. Children answered approximately 45% of the questions 
correctly regardless of their easiness scores or whether the 
BERT-based model answered them correctly. 


5.1 Limitations and Future Work 

If hypothesis 2 were true, we could use a BERT-based co- 
herence model to estimate the difficulty of deciding whether 
a given sentence will come next in a story context. We could 
then control question difficulty by using this estimate to help 
decide which sentence prediction questions to ask. Unfortu- 
nately, our results did not support hypothesis 2, which raises 
the issue of why they did not. The predictor coefficients in 
our regression analysis to explore this issue made qualitative 
sense but were not statistically significant. 


Perhaps children’s performance was affected by the added 
memory load of considering three sentences as choices. Fu- 
ture work could kid-test the simpler question “Is this next?”. 


Another possibility is that our coherence model was too im- 
poverished to reflect children’s inter-sentential processing. 
A richer model could capture other aspects such as causal 
relations, world knowledge, and inference important in story 
understanding. Or perhaps our BERT model merely needed 
better adaptation to the domain of children’s stories. 


An IRT model predicts probability of correctness based on 
student proficiency minus question difficulty. We did not 
take direct account of children’s differing proficiency, but 
the Reading Tutor gave children stories at their own read- 
ing level, accounting for their proficiency indirectly. Future 
analyses may need to account for proficiency explicitly. 
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APPENDIX 


Context: George's favorite subject was math. George learned to 
be a surveyor of land when he grew up. He joined the army and 
was a leader during the American Revolution. 


Choices Coherence Easiness 


Correct Answer: He later became the 


first President of the United States. oe 


Distractor 1: George Washington is 


called the "Father of our Country." aoe 0.244 


Distractor 2: We celebrate his birthday 


on President's Day in February. OeI0 


Figure 4: A question with easiness score of 0.244. 


Context: Both Brad and Sally pointed their flashlights into the 
dark. All they saw were some spider webs and a dead end. The 
cave was empty. 


Choices Coherence Easiness 
Correct Answer: Brad felt sad. 0.337 
Distractor 1: He had hoped they would 0.101 0.262 
find a big pirate ship or something neat. 
Distractor 2: Sally looked around the 0.599 


walls of the cave. 


Figure 5: A question with easiness score of -0.262. 


Context: When all the straw was spun away, and all the bobbins 
were full of gold. As soon as the sun rose the King came and 
when he perceived the gold he was astonished and delighted. 


Choices Coherence Easiness 


Correct Answer: But his heart only lusted 


more than ever after the precious metal. meee 


Distractor 1: He had the miller's daughter 


put into another room full of straw, 0.246 0.447 


Distractor 2: much bigger than the first, 


and bade her, if she valued her life, 1004 


Figure 6: A question with choices capitalized differently. 
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ABSTRACT 


The aim of the present study was to examine the rarely studied 
existence of sex-related behavior difference in online mathematics 
classes in China. Epistemic Network Analysis (ENA) was utilized 
in this study to explore the connection of students’ classroom 
behaviors, and the differences in connection patterns for boys and 
girls. The class monitoring videos of a sample of 64 students (32 
male, 32 female) was coded for microscopic categories of in-class 
behaviors, and all the codes were organized in the format of 
adjacent matrix. ENA model showed significant results that girls 
were more likely to engage in social activities in class, while boys 
exhibited more disruptive behaviors. There was also a relatively 
stronger connection between disruptive behaviors and call-out 
behaviors, and a slightly stronger connection between off-task 
behaviors and disruptive behaviors, and between disruptive 
behaviors and direct-no volunteer interactions for boys, compared 
with girls. This study provided an insight into the connection of 
different categories of classroom behaviors varied by gender, 
implying a future direction to examine the relationship between 
different behavioral connection patterns and students’ math 
achievement in online math classes. 


Keywords 


sex-related behavioral difference, math achievement, teacher- 
student interaction, Epistemic Network Analysis (ENA), online 
math class 


1. INTRODUCTION 
1.1 Background 


Previous studies examining sex-related differences in mathematics 
performance have reached inconsistent conclusions: while some 
reported a male advantage in math achievement, other studies 
only found sex-related differences in certain age groups and 
certain areas of mathematics ability, or even a female advantage 
in math exams [5][9][25][15][33]. 


On the other hand, although various research has been conducted 
to identify the specific contexts and factors that correlated with 
the difference in mathematics achievement of female and male 
students [7][11][14][21], the specific classroom behaviors and 


Yufei Gu and Kun Xu “Sex-Related Behavioral Differences in Online 
Math Classes: An Epistemic Network Analysis”. 2021. In: Proceed- 
ings of The 14th International Conference on Educational Data Min- 
ing (EDM21). International Educational Data Mining Society, 726-730. 
https://educationaldatamining.org/edm2021/ 

EDM ’21 June 29 - July 02 2021, Paris, France 


level of engagement, which have been linked by previous research 
to varying levels of mathematics achievement, have rarely been 
studied [10][26][17][22]. In a study conducted by Hart [13], boys 
were found to be more involved in public interactions in class 
with their teachers than girls, and the study indicated significant 
main effects of gender of students on two sub-categories of public 
teacher-student interaction: open volunteer interactions, and call- 
out interactions 


1.2 Online Math Classes in China 


During the past few years, China has witnessed an explosion of 
different online education platforms, which provides students with 
easy access to high-quality learning materials regardless of their 
geographical location. The bloom of online education also 
provides researchers with opportunities to conduct observational 
studies on teacher-student interactions without having to set up a 
camera or be physically present in the classroom. In this case, 
online learning platform provides a great opportunity for 
researchers to examine the behaviors of students of different sex 
in the online classes without influencing or interrupting how 
teachers and students behave and interact in class. Thus, the 
present study utilized the classroom monitoring videos from Spark 
EdTech, a Chinese K-12 online education platform that aims to 
cultivate mathematics thinking among mandarin-speaking 
children, to examine whether sex-related behavioral differences 
exist in online settings adopting the coding rules used in Hart’s 
framework [13][19][8]. 


1.3. Epistemic Network Analysis 

Epistemic Network Analysis (ENA) is a quantitative ethnographic 
technique designed to address questions in learning analytics and 
model the structure of connections in the dataset. Major 
assumption of ENA includes: 1) a set of meaningful features, 
which is defined as codes, can be identified systematically in the 
data; 2) the data has local structures, which are referred to as 
conversations; and 3) the way in which codes are connected to 
each other within the conversations is an important feature of the 
data [27][28][29]. 


ENA models the connections between codes by quantifying the 
co-occurrence of codes within conversations, generating a 
weighted network of co-occurrences and visualizations for each 
unit of analysis in the data accordingly. Since all the networks of 
units are analyzed simultaneously, ENA could ideally produce a 
set of networks that can be compared visually and statistically 
alike. Such a method has been used to not only analyze learning 
data, but also in other context where structure of connections in 
the data is meaningful, such as communications among health 
care team and gaze coordination during collaborative work [31][1]. 
Having recognized the unique power of ENA in analyzing 
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connections within the data, this study adopted ENA for exploring 
the connections of students’ behaviors in mathematics classes, and 
the differences between connection patterns for students of 
different sex. 


1.4 Goals 


The present study aims to: 1) examine the existence of sex-related 
behavioral differences in online math classes in China; 2) explore 
the connections of students’ in-class behaviors and engagement in 
classroom activities; 3) compare both visually and statistically the 
structure of connections of mathematics classroom behaviors for 
students of different sex. 


2. METHODS 
2.1 Participants 


In order to control for the influence of class content and teaching 
style on students’ classroom behavior and engagement, 12 math 
classes taught by 2 teachers (1 male, 1 female) of the same topic 
were randomly drawn from all Level 6 mathematical thinking 
classes of Spark EdTech. Each class consisted of 5 or 6 students 
of different sex, making up a sample of 64 students (32 male, 32 
female). Teachers from both sexes were selected in order to 
control for the potential interaction effect of students’ sex and 
teacher’s sex on students’ behavior. The average class duration for 
Teacher | was 49.13 (SD = 2.49) minutes, and the average class 
duration for Teacher 2 was 45.43 (SD = 0.84) minutes. Since 
students went through placement exams that determined their 
math ability before being assigned to different levels of classes, it 
can be assumed that students of the same level have similar level 
of mathematics ability. Level 6 class was primarily designed for 
third-grade students around eight years old. In this sample, 
students have the average age of 7.46 (SD = 1.63) years old. 


2.2 Procedure 

Class monitoring videos were viewed and coded by an 
experienced coder based on the definition of different types of 
classroom behaviors proposed in Hart’s study (1989). Each 
students’ classroom behavior and interactions with teacher was 
viewed and coded individually, following an event sampling or 
episodic approach, which has been widely used in the field of 
developmental psychology [16][18]. Microscopic categories of 
behaviors were coded in order to demonstrate the initiations and 
responses of the students and the teacher. When a behavior lasted 
for more than 20 seconds, such a behavior was coded again in 
order to indicate the continuity of that behavior. All the codes 
were organized in the format of adjacent matrix (see Table. 1 for 
an example) required by Epistemic Network Analysis (ENA), a 
sample of which can be found below. Then ENA will be applied 


to the data using the ENA Web Tool (Version 1.7.0) [20]. 

Student Sex Teacher social call open off disruptive 
ID ID activity out volunteer task behavior 
001 1 teacher 0 0 1 0 0 0 
001 1 teacher 0 0 0 0 1 1 
002 0 teacher 1 1 0 0 0 0 
002 0 teacher 1 0 0 1 0 0 


Table 1. Illustration of Coding Sheet 


2.3 Measures 

Several subcategories of students’ public interaction with teachers 
were identified in Hart’s study [13]. Two sub-categories of public 
teacher-student interaction which were found to be significantly 
correlated with students’ sex were: open volunteer interaction, 
and call-out interaction. Meanwhile, since another category of 
public teacher-child interaction, direct-no volunteer interaction, is 
pretty common in online classes, we decided to also include it as a 
type of behavior to be examined in our study. 


An open volunteer interaction was coded when the student 
indicated in some way other than by calling out a desire to 
respond to a teacher question or to initiate a public interaction 
with the teacher. A call-out interaction was coded when a target 
student called out the answer to a teacher question before the 
teacher gave permission for that student to respond. A direct-no 
volunteer interaction was coded when the teacher asked a question 
and requested that a target student answer who had not indicated 
in some way a desire to answer the question. The students usually 
indicated a desire to respond by raising a hand or calling out. 


In addition, we combined two types of behaviors that indicated a 
lack of engagement in mathematics activities in class which were 
found to be differently correlated with mathematics achievements 
for boys and girls in a study conducted by Peterson and Fennema 
[24]. Off-task behaviors were defined in this study as behaviors 
that are irrelevant to class activities. Social activities were defined 
in this study as the engagement in an activity in which the content 
of the activity involved a social topic, socializing or discussion of 
personal information or problems. Another category of behaviors 
— disruptive behaviors, which boys and girls differ drastically in 
the classroom settings was also included in our measure [4][6]. 
Disruptive behaviors were coded in this study when a student was 
engaged in behaviors that were likely to substantially or 
repeatedly interfere with the conduct and discipline of the class. 


3. DATA ANALYSIS AND RESULTS 
3.1 Definition of ENA Elements 


In the present study, the units of Analysis were defined as all lines 
of data relative to a single value of student’ sex subsetted by 
student ID. For instance, one unit included all the lines that 
represented the occurrence of each category of behaviors for one 
single student. 


In our ENA model, the following codes, which corresponded to 
the aforementioned five categories of classroom behaviors, were 
included: social_activity, direct_no_volunteer, off_task_behavior, 
open_volunteer and disruptive_behavior. 


Conversations were defined as all lines of data related to a single 
value of Teacher Name. For instance, one conversation consisted 
of all the lines associated with one of the two teachers. 


3.2 Procedure of ENA 


The ENA algorithm adopts a moving stanza window to generate a 
network model for each line in the data, showing how Codes in 
the current line are connected to codes that appear within the 
recent temporal context [30], defined as 4 lines (each line plus the 
3 previous lines) within a given conversation. The corresponding 
networks are aggregated for all lines for each unit of analysis in 
the model. In this model, we aggregated the resulting networks 
using a binary summation where the networks for a given line 
reflect the presence or absence of the co-occurrence of each pair 
of codes. 
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The networks for all units of analysis in the present model were 
normalized before being subjected to a dimensional reduction, in 
order to account for the different amounts of coded lines of 
different units of analysis in the data. In terms of dimensional 
reduction, a singular value decomposition was utilized, which 
produces orthogonal dimensions that maximize the variance 
explained by each dimension [2][28][31]. 


3.3 ENA Model 


Networks were visualized using network graphs where nodes 
correspond to the codes, and edges reflect the relative frequency 
of co-occurrence, or strength of connection, between two codes. 
The result is two coordinated representations for each unit of 
analysis: 1) a plotted point graph, which represents the location of 
that unit’s network in the low-dimensional projected space, and 2) 
a weighted network graph. The positions of the nodes in the 
network graph are fixed and determined by an optimization 
routine minimizing the difference between the plotted points and 
their corresponding network centroids. Because of this co- 
registration of network graphs and projected space, the positions 
of the network graph nodes—and the connections they define— 
can be used to interpret the dimensions of the projected space and 
explain the positions of plotted points in the space. Our model had 
co-registration correlations of 0.92 (Pearson) and 0.92 (Spearman) 
for the first dimension and co-registration correlations of 0.97 
(Pearson) and 0.97 (Spearman) for the second. These measures 
indicate that there is a strong goodness of fit between the 
visualization and the original model according to the rule-of- 
thumb by Shaffer, Collier & Ruis [28]. 


Mean networks for boys’ and girls’ behaviors in online math 
classes were constructed by averaging the connection weights 
across individual networks, and were compared using network 
difference graphs. These graphs are calculated by subtracting the 
weight of each connection in one network from the corresponding 
connections in another (See Figure 3 for a comparison ENA 
Model for student behaviors in online math classes by sex). 


According to Figure 3, the network centroids for boys and for 
girls differ along the x-axis. There is a relatively stronger 
connection between disruptive behavior and call out behaviors in 
online math classes for boys compared with girls. In addition, 
there is a slightly stronger connection between off-task behaviors 
and disruptive behaviors, and between disruptive behaviors and 
direct-no volunteer interactions for boys, compared with girls. 


4. CONCLUSIONS AND IMPLICATIONS 


The present study examined the differences in the structure of 
connections of classroom behaviors for boys and girls in online 
mathematics classes in China using Epistemic Network Analysis 
(ENA). The results indicated a significant difference along the x- 
axis of the model, suggesting that in our sample, girls were more 
engaged in social activities and open volunteer interactions with 
the teacher, while boys exhibited more disruptive behaviors 
during the class. The difference in off task behavior and call out 
interaction for boys and girls were not significant. Such a finding 
is largely consistent with the study conducted by Hart [13], except 
for we did not find a significant effect of sex on call out 
interactions. Such an inconsistency might be due to the unique 
characteristics of online classroom settings, where girls tend to 
experience higher influence of social presence on _ their 
satisfactory level in class, and thus are equally, or even more 
active in online discussion than boys [32][23]. 
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Figure 3. Comparison ENA Model for Student Behavior in 
Online Math Classes by Sex 


*Note: Blue represent boys, red represents girls 


Thus, in order to elevate their level of satisfaction in online math 
classes, girls might be more motivated to engage in interactions 
with their teachers by calling out their answers to teachers’ 
questions than they would normally do in traditional classrooms. 
Another possible explanation to this phenomenon is the reward 
system designed by Spark EdTech, a leading online education 
technology company in China, for its online mathematical 
thinking classes, where students could receive “little stars” from 
their teachers by answering questions and participating in classes. 
Such a reward system could encourage students to participate in 
classes, and to become the first to respond to the questions by 
calling out the answers. 


Another conclusion we could reach according to the comparison 
graph is that there is a relatively stronger connection between 
disruptive behavior and call-out behavior for boys compared with 
girls. Such a connection indicates that while boys call out their 
answers to teachers’ questions more often, they are more likely to 
also become disruptive in class by interrupting teacher’s lecture or 
interaction between other students and teacher. It could be implied 
that boys are more expressive and active in classes, yet such 
behaviors could become disruptive if they could not regulate their 
level of activeness in class or if they disregard class disciplines. 


In addition, there is a slightly stronger connection between off- 
task behavior and disruptive behaviors, and between disruptive 
behaviors and direct-no volunteer interactions. Such connections 
indicate that boys are more likely to violate discipline in class and 
be disengaged in class activities at the same time, and teachers are 
more likely to call on disruptive boys to answer questions, 
compared with girls. Such findings were consistent with the 
findings of research conducted in face-to-face classroom settings, 
that teacher tended to attend more to boys because they were more 
likely to exhibit disruptive behaviors in class [8][6]. All the 
findings mentioned above implied the necessity to recognize the 
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difference in boys’ and girls’ behavioral patterns in online math 
classes, and to call for the development of a more gender-sensitive 
guidance for online teaching, which had been pointed out recently 
by several researchers [3]. 


One major limitation of this study is the relatively small sample 
size. However, since each student’s behaviors during a full-length 
class were coded, we still obtained statistically significant results. 
At the meantime, due to the time constraint, all the videos were 
coded by only one coder, which might lead to biased data. The 
fact that the coder had more than 1000 hours experience of video 
coding and maintaining an inter-rater reliability of more than 0.9 
might, to some extent be able to account for such a limitation. 
Another limitation of the study arises from the nature of class 
monitoring videos, which might not always be able to fully 
capture students’ behaviors in class. Scarcely, when students 
wrote on a notebook or scratch paper, coder was unable to 
distinguish whether they were taking notes or engaging in off-task 
behaviors such as sketching. These ambiguous behaviors were not 
coded, and thus might lead to a slight underrepresentation of 
students’ off-task behaviors. 


Overall, the present study is mostly consistent with existing 
research on sex difference in students’ behaviors in traditional in- 
person classrooms. The novel findings of an insignificant sex- 
related differences in call-out behaviors could be attributed to the 
uniqueness of online class settings and the reward system adopted 
by Spark EdTech. To our knowledge, this study is the first of its 
kind to employ Epistemic Network Analysis (ENA) to examine 
the structure of connections of students’ classroom behaviors in 
online math classes in China and conduct a comparison of such 
connection patterns between boys and girls. Thus, the findings of 
this study could serve as the first step to examine the relationship 
between different behavioral patterns in online math classes and 
math achievement, and to develop a gender-sensitive guidance for 
online teaching. 
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ABSTRACT 


We introduce a new readability tutoring system, Read & 
Improve, a freely available online resource aimed at sup- 
porting learners of English and English Language Teaching 
(ELT) professionals by improving English learners’ reading 
proficiency. Using a combination of machine learning ap- 
proaches and natural language processing techniques, Read 
& Improve detects learning needs of every student and makes 
sure no learner is left behind by identifying reading content 
at an appropriate level of readability and helping learners 
acquire new words through accessible dictionary definitions 
and content exploration functionality|!] 


Keywords 
Distance Learning, Student Assessment, Natural Language 
Processing 


1. INTRODUCTION 


Reading is one of the fundamental language skills. Develop- 
ing this skill is an essential part of language acquisition, both 
for native speakers and second language learners [9] [23]. At 
the same time, developing reading ability takes a consider- 
able amount of time, and, as any learning process, it gets 
interrupted if readers lose motivation [| [25]. Such factors as 
not having a range of engaging reading content offered and 
being presented with reading material at the wrong level of 
readability are some of the major contributors to the de- 
creased motivation in readers [17]. In addition to language 
learners themselves, English Language Teaching (ELT) pro- 
fessionals face similar problems, as finding engaging reading 
content at the right level of readability is a challenging and 
a time-consuming task. In this paper, we present Read and 
Improve (R&I), a freely available, open-access educational 
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system that is aimed at both language learners and teach- 
ers 


To ensure that the reading content provided to a learner is at 
an appropriate level of readability, R&I uses machine learn- 
ing methods described in to automatically label texts 
with readability levels corresponding to the Common Eu- 
ropean Framework of Reference for Languages (CEFR) (6). 
The CEFR is an international standard that describes lan- 
guage ability on a six-point scale from A1 for beginners level 
up to C2 for advanced level of language proficiency. 


To ensure that the reading content presented to a learner is 
engaging, R&I employs news articles that are sourced from 
news websites in real time. To source news content, R&I 
monitors both RSS Feeds from news websites and the pub- 
licly available Common Crawl News (CC-NEWS) Dataset|*] A 
fully automated Indexing Pipeline (RUP, herein) processes 
news articles and automatically labels the readability of each 
article’s text. News articles are generally available for learn- 
ers on R&I within 10 minutes of publishing on an RSS 
news feed and in 3-6 hours of the article’s publishing time if 
sourced from CC-NEWS. As compared to other domains, news 
articles have the additional benefit of being generally free of 
grammatical and spelling errors, which allows us to achieve 
more reliable linguistic analysis and to provide learners with 
high quality reading content. R&I’s user interface (UI) en- 
ables learners to not only read the latest news articles but 
also to perform keyword search to find articles on topics that 
they are interested in at their desired CEFR level(s). 


A number of applications for various groups of readers, in- 
cluding native and non-native speakers, readers with cogni- 
tive impairments, and children, to name just a few, have 
been developed in recent years. In contrast to the pre- 
vious work [17], our platform is aimed specifically 
at developing reading ability in non-native speakers of En- 
glish. Our approach bears similarities to the Read-X and 
REAP systems, while also being actively developed and 
supported as an open-access educational platform available 
online. R&I is markedly different from other available appli- 
cations, as in addition to providing text search functionality 
(as in [5]) and vocabulary acquisition help (as in 4), it sup- 
ports comprehension testing and personalisation. 


“https: //readandimprove.englishlanguageitutoring. 


Shttp://commoncrawl.org/2016/10/ 
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The rest of this paper is structured as follows: Section 
provides an overview of the system’s architecture, Section 
describes the current UI functionality, and finally Section 
concludes the paper and describes future work. 


2. SYSTEM ARCHITECTURE 

Figure []] illustrates the system architecture of R&I. We do 
not describe the full details of system components here,as 
this is outside the scope of the paper. Instead, we provide a 
general overview of the components and their use of natural 
language processing (NLP). 


2.1 API 

The API connects to an information retrieval index (‘IR En- 
gine’), a database (‘DB Engine’), and several APIs to pro- 
vide the data and search functionality required by the UI. 
The IR Engine employs Elasticsearcl{'] (ES) and includes 
several distinct indices that facilitate search over news arti- 
cles and other data. 


2.2 RIIP 

RIP is responsible for processing articles into the ES article 
index. In order to prevent duplicate processing, the pipeline 
modules first check whether the output file(s) already exist 
in the ‘Data Lake’, a single store of all data processed. The 
API monitors the set of URLs listed in RSS feed(s) and the 
set of CC-NEWS files for new items, and if found, these are 
sent to RIIP for processing. Therefore, ingestion of new 
articles through the system requires no manual effort, and 
up-to-date news content is continuously processed and made 
available to learners via the UI. 


RIUP modules include: the Extractor, that extracts text and 
other information from news articles (ie. HTML); RASP 
that parses the text to provide linguistic information 
the LevelMarker module, that labels the text for readability 
(on the CEFR scale); and finally the ES module that indexes 
text and other linguistic information. 


2.3 LevelMarker Module 

For RIIP’s LevelMarker module we follow Briscoe et al. [3], 
and define the task of learning readability levels as a discrim- 
inative preference ranking task. We employ their machine 
learning (ML) software and use linguistic features outlined 
by Xia et al. that represent a text’s readability. 


2.3.1 Data 

We have crawled three publicly available news websites to 
create datasets: Breaking News English (BNE}’ (2771 ar- 
ticles), News in Levels (NIL)‘| (6373 articles) and Tween 
Tribune arr] (7768 articles). These websites have news 
articles labelled in terms of their readability however each 
website’s readability levels are based on different scales as 
shown in Table] Each of these datasets are considered to 


https://www.elastic.co/products/elasticsearch 
“https: //ilexir.co.uk/rasp/index.html 


ttps://breakingnewsenglish.com, 


ttps://www.newsinlevels.com 


ttps://www.tweentribune.com 


BNE to R, level map provided by the website: 
//breakingnewsenglish.com/news_levels.html 


=a 


=a 


=a 


=a 


Table 1: Dataset levels and distributions. 


(a) BNE (b) NIL 
| BNE level | CEFR level | Count 
0 A2 386 
1 A2 386 NIL level | Count 
pA A2 386 1 2126 
3 A2-B1 418 2 2124 
4 B1-B2 392 3 2123 
5 B2 392 
6 C1-C2 412 
(c) CER (d) TT 
| Exam | CEFR level | Count TT level Gount 
KET A2 64 
Grade K-4 (0) | 1965 
PET Bl 60 
Grade 5-6 (1) | 2029 
FCE B2 71 
Grade 7-8 (2) 1771 
pies . a Grade 9-12 (3) | 2003 
CPE C2 69 


Table 2: 5-fold cross-validation tests for each dataset. 


Source | Pearson’s | Spearman’s | Kendall’s 
BNE 0.8338 0.8368 0.6873 
NIL 0.9217 0.9164 0.7880 
TT 0.9055 0.9250 0.8071 
CER 0.9155 0.9185 0.8015 


be parallel as they contain multiple versions of the same ar- 
ticles simplified across different levels. While the BNE and 
NIL datasets are designed for L2 English learners, the TT is 
designed to help L1 learners (early and school-aged readers). 


2.3.2 Evaluation 

RIIP employs a model trained on the full BNE dataset as 
this dataset can be reliably mapped to the CEFR scale (Ta- 
ble|1). Based on this mapping we determined the ranges of 
ML scores that corresponded to each CEFR level (using ob- 
served score range from training data). We tested our model 
on the Cambridge English Readability (CER) dataset |") a 
publicly available dataset of 331 texts spanning CEFR lev- 
els A2 to C2 [18]. On this test set, our model achieves 
0.83 Pearson’s, 0.85 Spearman’s and 0.71 Kendal’s correla- 
tion coefficient. We also ran 5-fold cross-validation for each 
dataset'*| and present the results in Table[2| 


2.4 ES index 

In addition to article index, we create ‘WordInfo’ and ‘CALD’ 
indexes. The CALD indexing system processes definitions 
from the Cambridge Advanced Learner’s Dictionary (CALD) 
to populate the CALD index. The LexDoop system em- 
ploys Hadooyf?] to process the Data Lake files (currently 
around 1 million articles) to produce raw frequency counts of 
linguistic properties for every word leer liga this 
step, these lemma statistics are collated and added to the 
‘WordInfo’ index. 


https: //ilexir.co.uk/datasets/index.html 
1'We split the data randomly into training and test sets, 


ensuring an even distribution of class labels. 


™ Apache Hadoop: https: //hadoop.apache.org/ 


18T.exDoop is also used to process CC-NEWS files in parallel. 
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LexDoop generates frequency counts of 
linguistic features per lemma over data lake files 


Data Lake 


LexDoop system 


LexDoop extracts 
articles from new 
CC-NEWS files 


RIIP 


API Data Manager - tracks and sends new data to RIIP 
(directly or via LexDoop) 


CALD Re ey H 
hese we ed H t ' CALD } 
| Wordinfo ie" indexing |: 
: indexing } system i 
| system | ay ance ! 

RIIP files stored in Data Lake for R&D oi atin 
and prevent duplicate processing = 

DB Engine 

ELIT APIs 


Figure 1: Overview of R&I architecture. R&I is hosted within, and relies upon, cloud computing services from Amazon Web 
Services (AWS). Components that use cloud AWS services are shown with grey backgrounds. 


2.5 Sanitisation 

To make sure the content provided on the platform is accept- 
able for a wide range of readers across various ages and cul- 
tures, we apply content “sanitisation” strategy, whereupon 
we automatically filter out news articles that contain words 
pertaining to the topics that might be considered offensive 
in some cultures or inappropriate for younger readers. The 
list of around 1600 such taboo words was curated using the 
lists of taboo words from social media. Sanitisation is run 
within RIIP and the API and, in case the sanitisation sys- 
tem makes an error, the UI enables admin users to mark 
articles as ‘unsafe’ (or vice versa). 


3. READING ON THE PLATFORM 


We define the R&J functionality in terms of four major as- 
pects, which cover the tutoring system’s ability to provide 
learners and teachers with engaging reading content at the 
appropriate level of Sieh oe, help learners develop 
their vocabulary in English (43.2); run comprehension tests 
(3.3); and allow learners to revisit texts they read, words 
they clicked on and tests they submitted (43.4). 


3.1 Finding engaging reading material at an 


appropriate level 

The first step for learners accessing R&I is to define their 
language proficiency level. Learners can log in to R&I using 
their account credentials from Write & Improve}"*|a freely 
available system linked to the reading platform, that is able 
to assess and provide feedback on a learner’s writing profi- 
ciency. Once logged in, R&I defaults reading proficiency to 
current writing proficiency, but a learner can change their 
CEFR reading level. 


Figure [2] contains a screenshot of the search page’s results 
showing the latest news articles at the learner’s CEFR level 
(currently B1). The search page provides learners with snip- 


“4nttps://writeandimprove.com/ 
R&l employs Write & Improve APIs developed by ELiT: 


https: //englishlanguageitutoring.com/ 
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pet(s) of the article text, and they can click on any of the 
titles listed on this page in order to load the article view page 
where they can read the article itself. In addition, search by 
keywords is enabled on R&I to allow learners to find articles 
not only at their level of readability, but also on the topics 
of their interest. 


Search news articles that suit your level of English Q 


CEFR Levels:© All CO A2 @©B1 OB2 OC1 OC2 © C2+ 


Latest news stories 


208696 results (0 seconds) 
Results were found at CEFR reading level B1. 


Gladys Berejiklian calls border closures a ‘national embarrassment' 

@ 

dailymail.co.uk - Mar 9th 21 - 6 minutes ago 

Gladys Berejiklian has branded sudden state border closures a ' complete over- 

reaction ' and demanded a more measured response to Covid outbreaks. The 

New South Wales leader was furious with other premiers when they closed their 

borders in December over a handful of cases on Sydney's Northern Beaches. 
Si: 


Joe Biden's dogs are taken back to Delaware after a ‘biting incident’ 
@ 

dailymail.co.uk - Mar 9th 21 - 7 minutes ago 

President Joe Biden has sent his two dogs back to his family home in 
Wilmington, Delaware, after the younger of the two German Shepherds was 
involved in a' biting incident ' with a White House security agent. Three-year- 
old Major, whom Biden and his wife Jill adopted in November 2018 from an 
animal shelter... 


Mum throws incredible circus birthday party for her daughter, 2 @Q 


Your CEFR Level 


dailymail.co.uk - Mar 9th 21 - an hour ago 

An Australian party planner has shown off show-stopping photos of a two-year- 
old’s birthday party which took 18 months to plan, two days to set up and used 
over 2500 balloons. The incredible birthday party was supposed to mark baby 
Ivory’s 1st birthday - but had to be post-phoned due to the country’s lock... 


Figure 2: Screenshot: search results. 


3.2 Developing one’s vocabulary 

Vocabulary is very important in language learning to the 
point that language learning itself would sometimes be equated 
with knowing language vocabulary [12]. To help learners 
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JS 


present, verb present, noun present, adjective 
The word you clicked is 'presents' which is a form of 
‘present’. Our automated system believes this word is 
probably a verb @. 

Search for present, verb in other articles 


(a) Part of speech (PoS) statistics 


datum 


after ; 
result NUMBER from 


himself 


idee by opportunity 
at : 
themselves evidence 
problem ** not with way 
in challenge 


how show itself team 


repo HAVE TO of informatio 


on 
also 


threat plan when 
be t« ™* & 
sey before 
proposal ‘nding case for 
where oe 
study research year 
then Trump during 


(b) Word cloud 


present > INTRODUCE 
verb 
* [Ej [7 to introduce a television or radio show: 


She presents the late-night news. 


* B [T] to introduce a person: 


May | present Professor Carter? 
Later on I'd like to present you to the headteacher. 


(c) CALD word sense ‘INTRODUCE?’ 


Figure 3: Screenshots of the sections of the ‘Word Information’ and ‘English Dictionary’ panels on the UI. Here, the user 
clicked on the word presents, used as a verb. The pie chart in (a) illustrates the relative frequency of all PoS categories for the 
lemma present across all articles. The word clouds in (b) contain the 50 lemmas most frequently co-occurring with present as 
a verb in grammatical relations (where font size reflects relative frequency), and (c) shows the dictionary definition. 


Comprehension Test: 


Writing Level score: 


Summary Relevance: yr * * * * 


Try Again 


Here you can see your last 20 summary checks and their scores. Click ona 
summary check in the graph to see your summary. 


-o- Score -o- Relevance 


C2+5 


Writing Score 
2 
\ 
\ 
\ 
@OURAR|OY 


Number (Maximum - latest 20) 


Figure 4: Screenshot: Comprehension Test panel. Learners 
are able to click on the graph to view previous summaries, 
which they can refine and re-submit. 


with vocabulary acquisition and development, R&I allows 
them to select any words they do not recognise or wish 
to learn more about within the article view page. When 
a learner clicks on an unknown word, R&I’s UI launches 
two side panels for Word Information and English Dictio- 
nary (shown in Figure to display information available 
for the word in the ‘WordInfo’ and ‘CALD’ index, respec- 


tively ({2.4). 


Several searches can be performed by clicking on links within 


the Word Information panel and words within the co-occurrence 


word cloud. These links to search results shown in R&I’s 
search page enable learners to perform advanced, linguisti- 
cally motivated searches intuitively and learn how vocabu- 
lary is used in context. 


3.3. Running comprehension tests 

R&I allows users to submit a summary of the article as a 
comprehension test in the Comprehension Test panel on the 
article view page (Figure[4p. R&I automatically scores these 
summaries and returns a writing score, determined by a ma- 
ture feature-based automated essay scoring (AES) model 
[20], graded on the CEFR scale via the Write & Improve 
API, and a relevance score based on the maximum sentence- 
level cosine similarity value, which is then converted to a 
score in the range 0-5 using the lexical overlap between the 
article and the summary |7| that shows whether the learner 
captured the main salient topics in the article. 


3.4 Accessing reading history 

All history of learner interaction with the R&I platform, 
including texts, vocabulary items and submitted summaries 
is available to the learners on the personal My Reading pages. 


4. CONCLUSIONS AND FUTURE WORK 


In this paper, we presented Read & Improve, a freely avail- 
able, open-access reading tutoring system that is aimed at 
language learners and teachers. Currently, it is a prototype 
system, and thence most of its components will benefit from 
further research on the platform. For instance, we are plan- 
ning to improve our Indexing Pipeline using quality human 
annotated training data and user analytics that we are col- 
lecting via the R&I platform. 


R&I records learners’ actions on the UI, which in turn, will 
provide valuable data for use in further research and devel- 
opment. For example, employed the comprehension test 
data collected by the platform to develop a new automated 
comprehension test (summary assessment) marking system 
suitable for use in R&I. Further, each learner’s data may be 
useful in directly improving their learning experiences. For 
example, analysis of an individual learner’s history could be 
used to tailor custom content and testing.This symbiotic re- 
lationship, developed in an ecosystem of freely available ed- 
ucational system benefiting from cutting-edge research, will 
ultimately produce a state-of-the-art ELT resource. 
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ABSTRACT 


We present Generate, a Al-human hybrid system to help education 
content creators interactively generate assessment content in an 
efficient and scalable manner. Our system integrates advanced 
natural language generation (NLG) approaches with subject 
matter expertise of assessment developers to efficiently generate a 
large number of highly customized and valid assessment items. 
We utilize the powerful Transformer architecture which is capable 
of leveraging substantive pretraining on several generic text 
corpora in order to produce sophisticated, context-dependent text 
as the basis for item creation. We present early results from 
experimental studies demonstrating the efficiency of our 
approach. 


Keywords 


NLP, Transformer Networks, Domain Knowledge Modeling 


1. INTRODUCTION 


The COVID-19 pandemic has accelerated the push towards 
remote delivery of formative and summative assessments and with 
it have arisen heightened security concerns of item pool exposure. 
Moreover, there is growing adoption of highly personalized and 
adaptive learning and assessment experiences [25] that require 
regularly replenished assessment item pools. These twin factors 
among others are placing ever growing demands on traditional 
processes of creating assessment content that are based in large 
part on manual labor, highly dependent on subject matter 
expertise and challenging to scale up. Furthermore, the manual 
generation of content and assessment items heightens the risk of 
incomplete, duplicate and/or redundant content. We believe 
advances in AI, particularly natural language generation (NLG) 
can help mitigate this bottleneck and open new possibilities for 
personalized learning experiences. 


Classical natural language processing (NLP) work in this area 
dates back to John Wolfe’s seminal work [17] that demonstrated 
the feasibility of automatically generating natural language 
questions. In recent years there has been a revival in interest, 
spurred in part by advances in dialogue systems such as Amazon 
Alexa. While traditional approaches to NLP-based assessment 
item generation involve a pipeline of modules such as content 
selection, template design and item realization [18], these have 
been criticized for being rigid and too reliant on arbitrary heuristic 
rules and having limited novelty and psychometric variability 
[19]. There is growing interest in developing end-to-end deep 
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neural network based approaches that do not require customized, 
hand crafted rules and are better equipped to generalize across 
content areas [20]. A key element of such approaches is 
leveraging large text content databases and well annotated 
datasets such as BookCorpus [21], SQuAD [22] and Wikipedia. 
For further details on related work, readers are directed to the 
survey of state of the art by Kurdi et al. [26]. 


In this paper we present Generate, a NLG system that efficiently 
and in real-time creates lexically and semantically appropriate 
item content, dramatically speeding up assessment item authoring, 
freeing item writers and subject matter experts (SMEs) from 
unnecessary work, and can enable personalized learning and 
assessment experiences. At the core of Generate’s content 
generation capabilities is the Transformer architecture [4] that 
leverages substantive pretraining on several generic text corpora 
to produce sophisticated, context-dependent text as the basis for 
item creation. From a small number of representative items as 
training samples to learn lexical and semantic structure, Generate 
is able to produce a wide variety of draft item content. Users 
utilize an intuitive graphical interface that allows selection of item 
stems, keys (correct answers) and distractors from a number of 
generated options. 


In the following sections we provide technical details of our 
system starting with a brief review of Transformers, system 
implementation and architecture. Following that we present 
analysis from experimental studies and share thoughts on future 
directions. 


2. TECHNICAL APPROACH AND 


SYSTEM DETAILS 
2.1 Transformers and NLG 


In order to capture the subtlety and breadth of lexical patterns 
necessary to faithfully generate novel assessment content, we 
opted to base our NLG engine on the Transformer architecture, 
which is capable of leveraging substantive pretraining on several 
generic text corpora in order to produce highly sophisticated, 
context-dependent token embeddings for a variety of NLP tasks. 
First proposed in 2017 by Vaswani et al. [4], the Transformer 
architecture has since revolutionized NLP research, with 
state-of-the-art performance on benchmarks like the broad GLUE 
suite of NLP tasks [5] being set by Transformer-based models 
such as Google’s BERT [6] and OpenAI’s GPT series [7, 8, 9]. 


The central idea of the Transformer architecture is to do away 
with sequential processing of text altogether, as was done 
traditionally with deep-learning architectures like LSTMs [10] 
and GRUs [11], and instead process the tokens (words, subwords, 
and punctuation) of text simultaneously using an operation called 
attention. The variant of attention used in the original formulation 
of the Transformer architecture, scaled dot-product attention, is 
defined as follows: 


T 
Attention(Q, K, V)= softmax() V 
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The matrices Q and K are called the queries and keys, 
respectively, and each have column-dimension d, while the matrix 
V, called the values, has column-dimension d’. We consider the 
case when Q, K, and V are all the same matrix X, and the resulting 
operation is known as se/f-attention. Each row of X corresponds to 
a context-independent dense embedding of a single token with a 
small positional encoding vector added so that the model can take 
into account the position of the token in the input text. Thus, 
self-attention recomputes every token as a linear combination of 
every other token, where the weights in the linear combination 
Qk" 
ya 
order to allow the Transformer to learn several different patterns 
of lexical interaction, several matrices of weights are used to 
compute multi-head self-attention: 


MultiHead(X, _X, X) = Concat(head,, head,, ..., head,)W°, 


depend on a scaled dot-product similarity (the term). In 


where 
head; = Attention(X¥W,2, XW;§, XW,’), 


and all of the W matrices consist of learnable weights. After 
multihead attention is computed, the results are aggregated and 
resized using a simple single-hidden-layer feedforward neural 
network, which has its own learnable weights. This combination 
of multihead attention followed by a feedforward neural network 
constitutes the fundamental building block of the Transformer 
architecture: the Transformer block. A Transformer model, then, 
is built by chaining together several Transformer blocks, each 
potentially with its own set of weights. 


For its NLG engine Generate utilizes a Transformer model 
pre-trained for the task of next-token prediction. The Transformer 
architecture processes the conditioning input text in order to 
produce a probability distribution over all tokens in the 
vocabulary. We sample from the vocabulary according to this 
distribution, and then proceed auto-regressively: we process the 
newly sampled token and use it to produce a new probability 
distribution and sample a new token. We continue in this way 
until a maximum token limit is met, or until a stop sequence is 
produced (such as a newline character ‘\n’). 


2.2 How Generate Works 


As illustrated in figure 1 Generate has five main system 
architectural components. The first is a React Javascript-based 
graphical user interface. Through the interface, users can select 
pre-uploaded AI models, generate an item, visualize and edit the 
item and visualize the metrics generated by the AI, allowing the 
user to create a complete item generation flow, from creation to 
validation. The user interface is linked to the second component 
which is the AuthO authentication platform, a third party service 
specialized in secure authentication and authorization workflows. 
Once the user is authenticated, the GUI will connect with the third 
component, Hasura [2]. Hasura serves mainly as a GraphQL API 
to connect the GUI with the database and the serverless services. 
The fourth component is the item generation services (SQS Queue 
and Lambda Worker), which are responsible for interfacing with 
the NLG engine API with all the advantages of a serverless 
architecture [3]. The NLG engine API forms the last core 
component and is responsible for generating content based on the 
model provided. 


Users begin their interaction with Generate by providing 
specifications of desired content including: a content map of item 
types and topics to be generated; user-specific writing guidelines 


a 


Generate GUI 


Autho 


Generation 


Worker Hasura 


Al Engine 
API 


Figure 1: Generate system architecture is designed to be 
modular with distributed services hosted on AWS. 


Hi: generate 


Co) 
< 


Welcome, 


Mar 11,2021 
RN Management of Care #4 v Se Seve item Balance 381 Generate Credits 


tem Quality Metrics 

diabetes mellitus (IDDM), 

of impending cardiac arrest 76.0 0.8 
Word count Type-Token Ratio 


Which action would be the correct first action? 


Q 35.7 20.8 


Flesch Reading Ease Coleman-Liau index 
(1) Position the client to taciltate breathing, 


Download Item # 
(2) Administer oxygen by nasal cannula. = 


(3) Immediately call the code blue 


(5) Acminister oxygen by face mask 


Figure 2: Generate item authoring interface. Users can select 
from a number of item generation models and create items 
on-the-fly with the click of a button. 


so that domain semantics and formatting can be tailored to best 
practices; and specification of admissible lexical metric ranges, 
such as type-token ratio, Flesch Reading Ease, and the 
Coleman-Liau Index. These specifications constitute what is 
called a project. Along with these specifications, users must also 
supply a set of representative items. The number of such items is 
usually between 100-200 items total, although it will depend on 
the complexity of the content map specified in the project and the 
number of different types of items to be generated. At a baseline, 
all that is needed is the raw item, though users may supply their 
own item metadata to help improve the performance of our AI 
engine. Features such as item topic categorization, key and 
distractor labels, item cognitive type (recall, application, etc.), and 
difficulty metrics like p-value and point biserial [12] may all be 
used to further hone our AI models’ performance. 
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After supplying content specifications and representative items, 
the k-means clustering algorithm [13] is applied to produce 
several groups of 8-10 representative items each. The clustering 
algorithm is predicated on a combination of user-supplied 
metadata, as well as numeric item features produced by the 
Transformer-based Universal Sentence Encoder (USE) model 
[14]. The goal of this clustering procedure is to produce groups of 
items which are semantically and stylistically homogeneous, 
which in turn improves the reliability of our AI engine to produce 
items which are coherent, semantically and factually relevant to 
the content domain, and stylistically appropriate according to the 
user’s writing guidelines. Each group of representative items 
corresponds to a different string of conditioning input text for our 
AI engine, which we refer to as a model. Each model produces a 
different “flavor” of content. By building several models, we 
ensure that a user’s content specification demands are met, and a 
wide diversity of items is produced while doing so. 


Given a model, raw content is generated by our NLG engine and 
then undergoes several automated quality checks before being 
presented to the user. First, the raw content must pass a parse 
check to ensure that desired item formatting has been captured. 
Next, we perform an overlap check to ensure that no part of the 
generated content overlaps too heavily with the representative 
items, or with any other part of the generated content (e.g. to 
prevent duplicate options in a multiple choice item). For multiple 
choice content, users are able to specify a range of options to 
generate and so we also check that a sufficient number of unique 
options are produced. Finally, previously specified lexical metrics 
are computed, and we check that the generated item lies in the 
user-specified admissible ranges for these metrics. 


As shown in figure 2, Generate offers a graphical user interface 
where item writers interact with our NLG engine directly to 
produce content. With this interface, item writers can select one of 
several item generation models and generate items on-the-fly with 
the click of a button. The item writer can then refine and annotate 
generated items before saving a finalized version. Generate’s 
content generation interface allows item writers the ability to save 
an intermediate version of a promising item and regenerate 
unwanted parts. For example, with multiple choice content, one 
can select a key and one distractor from the list of available 
options, and then regenerate the remaining options to produce a 
fresh list to choose from. In this way, item writers can use 
Generate to help them rapidly ideate additional options, leading to 
significant speedups to the item-writing process. 


Users can review generated content at any time using Generate’s 
content dashboard. This dashboard allows users to review 
project-level information such as lexical metric distributions, as 
well as review individual items and their SME annotations. Once 
a selection of items has been made, users can download their 
content either as raw text or in QTI format. 


2.3 SME Usability Experiment Results 


For the content domain of nursing professional licensure, we 
performed two experiments with a subject matter expert 
(SME)/item writer in the domain. In the first experiment, the SME 
was asked to perform a quality review of a set of 40 items purely 
created by Generate NLG spanning four topic areas including 
biotechnology, medical assisting, nursing assisting, and practical 
nursing (see figure 3 for an item from this set). For a baseline of 
comparison, we mixed in a “calibration set” of 40 representative 
items produced by a separate human item writer spanning these 
same four topic areas. The SME was not told which items were 


An 80-year-old female falls out of her wheelchair and on to the floor in the clinic waiting area. She is 
on oxygen and has an indwelling catheter. The nurse begins to set up equipment and ask questions 
to assess the client's complaints. When assessing this client, the nurse is primarily concerned with: 


(A) Pain level and location 


(B) Breathing and respiratory status i) 


(C) Capillary refill and sensation in all extremities 


(D) Mental status 


Figure 3: Sample item created by Generate. A user/SME is 
able to select the key/correct answer and make any edits 
required. 


from the calibration set, and which were created by Generate. To 
perform the quality review, the SME was asked to check factual 
accuracy and topic relevance, make any necessary edits, estimate 
the difficulty of the item on a subjective easy/medium/hard scale, 
give any general comments and feedback, and assign a subjective 
overall quality rating on a 1-7 Likert scale (with 1 being poor and 
7 being excellent). The median quality rating for Generate items 
was 5.5, compared to a median quality of 6 on the calibration set, 
with 70% of Generate items rated 5 or higher. There was also 
considerable overlap in the difficulty distributions, as shown in 
table 1. It took the SME an average of 3.75 minutes/item to 
perform this quality review. Compared to the SME’s estimated 
20-30 minutes/item to write an item manually, Generate 
demonstrates clear improvements to SME item writing 
throughput. 


In the second experiment, the same SME was asked to interact 
directly with the Generate content generation interface to produce 
50 more items in the domain of nursing professional licensure. We 
gave the SME five models ranging over a single nursing topic and 
requested that they produce ten items for each model. We captured 
data on generation time as well as item survival rate. For each 
item, the SME used Generate’s content generation interface to 
first generate a multiple choice item with eight possible options. 
The SME was then asked to select the best combination of key 
and three distractors from these available eight options, and then 
perform necessary fact checking and editing. Using the Generate 
system, it took the SME an average of 2.7 minutes/item, including 
latency necessary for the system to generate the raw item. 
According to SME testimony, a similar exercise with a 
conventional item writing approach would have taken roughly 30 
minutes/item, not including slowdowns due to SME fatigue and 
burmout. We are currently working on a number of follow-on 
experiments with item writers in a variety of domains including 
K-12 education, higher-ed and professional licensure. 


Table 1: Comparison of difficulty distributions 


ee 


Generate 


Calibration 


3. DISCUSSION AND FUTURE 
DIRECTIONS 


Our early investigations with item writers indicate a significant 
increase in assessment authoring throughput, which if borne out in 
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future and ongoing studies would mitigate bottlenecks in 
producing easily accessible high quality assessment items. We 
believe this would help enable innovations in formative 
assessment, personalized learning, and building customized and 
efficacious classroom activities. 


In addition to assessment content generation, we plan to 
implement the following functionality to Generate over time. 


3.1 Item Difficulty Estimation 


Content creators must ensure that assessments adhere to a desired 
difficulty distribution, where we take difficulty to be measured by 
p-value (the proportion of examinees that answered a question 
correctly). Current methods for estimating p-values involve 
manual field-testing of provisional items, which is _ both 
time-intensive and risks item exposure, reducing the lifespan of 
the item. While previous work in automated difficulty estimation 
has employed techniques of first-order-logic [15] as well as a 
machine learning-based word embedding approach [16], we are 
exploring a blended approach which leverages structured item 
metadata with Transformer-based processing of unstructured item 
text. In this way, users can quickly recycle items which do not 
adhere to required difficulty specifications, thereby increasing the 
survival rate of items produced by Generate. 


3.2 Automated Content Tagging 


Tagging educational content with the most relevant learning and 
assessment standards such as CCSS [23], NGSS [24], etc. is one 
of the most critical elements in creating highly efficacious 
content. This enables the tracking of student skill gaps, 
recommendation of remediative learning resources and mastery of 
discipline topics, skills and cross cutting capabilities. We are 
currently developing a text content classification approach that 
can be used to delineate skills, learning objectives and core 
disciplinary ideas in the generated assessment items. 


4, CONCLUSION 


In this paper we have introduced Generate, a system that utilizes 
an NLG approach to significantly increase productivity of 
assessment content creators. Generate is built on a language 
modeling architecture that understands the deep semantic and 
lexical structure of assessment content that allow us to handle a 
variety of assessment domains and item types. Our system’s 
content dashboard integrates elegantly with existing item writer 
workflows for item review, editing and approval. To the best of 
our knowledge Generate is the first NLG content authoring 
system designed for use in education and we believe can enable 
innovations in personalized learning, formative assessment and 
efficacious classroom activities. 
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ABSTRACT 


We present Catalog, an educational content classification and 
alignment system that tags learning and assessment content in a 
semantically meaningful and accurate manner. Unlike other 
approaches that rely on keywords or search terms and crosswalks 
between knowledge taxonomies, Catalog utilizes powerful NLP, 
specifically language models based on the Transformer 
architecture, to encode content in a context attentive fashion. This 
allows us to capture deep conceptual and contextual relations in 
content to classify it against a wide variety of educational 
standards and taxonomies. We present results from empirical 
studies demonstrating efficacy of our approach in classifying 
learning content to the Next Generation Science Standards 
(NGSS). 


Keywords 


Content tagging/classification, NLP, Transformer Networks 


1. INTRODUCTION 


Tagging educational content with the most relevant learning and 
assessment standards and education search terms is one of the 
most critical elements in creating highly efficacious content. This 
enables the tracking of student skill gaps, recommendation of 
remediative learning content and mastery of discipline topics, 
skills and cross cutting capabilities. With the ever growing 
volume of digital learning content and educational standards [12] 
the demands on tagging content are not being met by current 
solutions. 


Current processes to tag content typically starts with raw untagged 
content that has to be manually reviewed, understood and 
analyzed by subject matter experts (SMEs) and then classified 
against a particular education standard e.g. the NGSS [10] 
resulting in the first set of foundational standards tags. Typically, 
these standards are hierarchical and utilize a taxonomic 
knowledge representation to capture the knowledge structure 
including core disciplinary knowledge, skills and/or cross cutting 
capabilities. Given the foundational tags one can transfer onto any 
number of desired taxonomies, for instance the Common Core 
State Standards [11], using taxonomy crosswalks [13]. Crosswalks 
are essentially mappings from one standard’s taxonomy to another 
that have for the most part been developed by SMEs and are many 
times proprietary limiting their applicability. 
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While in theory this process seems to offer a relatively scalable 
solution to the content tagging problem, in practice it is inefficient 
and has significant limitations. Firstly, the initial step of creating 
the foundational tags is manually executed and highly subjective, 
making it expensive and error prone. But even when that is done 
well the taxonomy crosswalks do not offer a perfect solution, 
because these crosswalks are not one-to-one mappings between 
the tags of one taxonomy and the other. Due to the hierarchical 
nature of the standards taxonomies and how they are designed and 
crafted by SMEs, oftentimes there are vast differences in the 
levels of knowledge abstraction, resulting in many-to-many 
mappings for the crosswalks connecting them. The end result is 
that for a given unit of content even when there is a foundational 
tag available and using an associated crosswalk, SMEs still have 
to make the final adjudication of the most appropriate tag in the 
target standard’s taxonomy. 


To address these challenges we have developed Catalog, an 
automated content classification system that leverages recent 
advances in NLP, specifically the Transformer architecture. This 
allows us to analyze educational content with richer context-aware 
text embeddings and pre-trained language models. We have 
evaluated the accuracy of our approach with promising results on 
an OpenStax Biology textbook [14] with ground truth NGSS tags 
(human experts labeled). We believe Catalog can significantly 
help streamline and accelerate manual workflows around content 
tagging and curation. These are applicable for both existing or 
new content, enriching existing content tags for more targeted 
search, discovery and recommendation as well as maintaining 
content alignments as educational standards evolve. 


2. TECHNICAL APPROACH AND 
SYSTEM DETAILS 


2.1 Transformers and Text Embeddings 


At the core of Catalog’s content classification tagging system is 
the Transformer architecture, first proposed by Vaswani et al in 
2017 [2]. Catalog utilizes a series of pre-trained Transformer 
models [5, 6] to encode text-based content in vectorized features 
which are then further used to analyze the probability that the 
content is related to a textual description of the target taxonomy. 
Further details of this approach are presented in the following 
subsections. Here we present a brief overview of the Transformer 
architecture. 


By eschewing the sequentially-processed nature of previous 
deep-learning NLP architectures (like LSTMs [3] and GRUs [4]) 
in favor of multi-head attention, the Transformer architecture is 
highly parallelizable and_ scalable, allowing for richer 
context-aware text embeddings and a substantial pre-training 
capacity which allows for a transfer learning approach to NLP 
tasks. Since its inception, research into the Transformer 
architecture has exploded, with variants such as Google’s BERT 
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[5] and OpenAI’s GPT series [6, 7, 8] topping several NLP 
benchmarks, such as the multitask GLUE suite [9]. 


As indicated above, Transformers are a deep-learning architecture 
based on the attention mechanism. The original formulation of the 
Transformer architecture used a variant known as _ scaled 
dot-product attention, defined as 


T 
Attention(Q, K, V) = softmax( 7) 


where the matrices QO and K are called the queries and keys, 
respectively, and each have column-dimension d, while the matrix 
V is called the values and has column-dimension d’. When the 
queries, keys, and values are all equal to some matrix X, the 
resulting operation is called se/f-attention. The rows of this matrix 
X correspond to context-independent feature vectors of the tokens 
of the input text, each with a small positional encoding vector 
added so that the model is aware of each token’s position within 
the input sequence. Self-attention can be thought of as an 
operation which recomputes each token as a linear combination of 
the other tokens, where the weights of the linear combination 


correspond a scaled dot-product similarity score (the term) 

In this way, (potentially long-range) interactions between tokens 
are captured. To allow the Transformer to learn different patterns 
of interaction, several matrices of learnable weights are used to 
compute multi-head self-attention: 


MultiHead(X, _X, X) = Concat(head,, head), ..., head,)W°, 


where 
head, = Attention(XW,2, XW, XW,’), 


and all of the W matrices consist of learnable weights. After 
multi-head self-attention is computed, the resulting feature vector 
is fed to a single-hidden-layer feedforward neural network for 
aggregation and resizing. These two consecutive operations, 
multi-head self-attention followed by the feedforward neural 
network, constitute the core of a Transformer block. A 
Transformer model, then, is built by chaining several Transformer 
blocks together, each potentially with their own set of weight 
matrices. 


22, How Catalog Works 


Catalog’s AJI-powered content tagging system utilizes a 
Transformer-based semantic matching engine to rank taxonomic 
categories by their semantic similarity to given educational 
content. The semantic matching algorithm works as follows. We 
are given a collection of textual descriptions of taxonomic 
categories (e.g. NGSS [10]), which we refer to as “documents,” 
and the raw text of educational content, referred to as the “query” 
that needs to be classified. For each document, we produce a 
string of input text by combining it with the query along with a 
small amount of connective text. Using a Transformer model 
pre-trained for next token prediction, we then process the input 
string to convert the query tokens into feature vectors. These 
feature vectors are then further processed to produce probabilities 
for each query token, conditioned on the document text. 
Additionally, we process the query text by itself in order to 
determine unconditioned probabilities for the query tokens. 
Finally, a match score is produced for each document by 
comparing the conditioned vs. unconditioned query token 
probabilities and then aggregating these into a single real-valued 


score. Documents are then ranked according to these scores, with 
a higher score indicating a higher match similarity. 


Al Engine 


Figure 1: Catalog system architecture is designed to be modular with 
disturbed services hosted on AWS. 
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Figure 2: Sample page from the OpenStax Biology 2e textbook used in 
our experiments. 


System architecture and implementation wise, Catalog has two 
core architectural components as shown in figure 1. The first is a 
Lambda API Endpoint that leverages the serverless architecture 
and serves mainly as an interface between the user and the 
Transformer-based semantic query process, the “AI Engine”. It 
authenticates users’ requests, manages requests and accesses the 
system database. The second major component, “AI Engine” 
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manages content processing and returns the match scores from 
classification. 
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Figure 3: Sample from NGSS High School Life Science Biology 
performance expectation (PE) standards. Image shows 4 of the 24 
unique PE standards used in our experiment. 


2.3 Experimental Results 


We tested the accuracy and performance of our approach on a 
learning content dataset extracted from the OpenStax Biology 2e 
high school textbook [14]. The dataset consists of approximately 
500 pages of content spanning 98 chapter/subchapter sections that 
ranged from 410 to 545 words each. Each of the book’s 98 
sections is annotated with NGSS High School Life Sciences 
(HS-LS) performance expectation (PE) tags [10], also provided by 
OpenStax [14] and served as the ground truth labels in our 
experiment. There are a total of 24 unique Biology NGSS PE 
standards applicable to our dataset, essentially rendering this a 24 
class classification problem. Figure 2 shows a sample from the 
OpenStax textbook and in figure 3 we include sample PEs from 
the 24 NGSS standards used in our experiment. We note that 
these ground truth labels are not necessarily unique: each section 
is associated with one to three NGSS tags. 


Topic documents for the 24 PE standards were assembled from 
the Topic Arrangements of the NGSS that includes descriptions of 
PEs, Science and Engineering Practices, Disciplinary Core Ideas, 
Crosscutting Concepts. Because our model predicts NGSS tags 
for a given OpenStax section by ranking them, we assess 
performance by computing the top-n overall accuracy, that is, the 
proportion of predictions which have at least one ground truth 
label in their top-n ranked predictions (note that for n = 1, this is 
just the traditional overall accuracy measure). For comparison, we 
had an SME perform this classification exercise manually i.e. 
provide up to three suggested NGSS PE tags for each of the 98 
book sections in our dataset. This SME is a high-school science 
teacher in a New York city school district and is highly 
experienced with the NGSS standards. 


Before examining the results of this experiment, we note that one 
NGSS standard, HS-LS1-2, was severely overrepresented in our 
dataset, accounting for nearly 42% of all ground truth tags, more 
than 5 times the next-most-represented tag. To account for this in 
our accuracy computations, we decided to take 1000 random 
subsamples of this class, and then average the top-m accuracy over 
these subsamples. Figure 4 shows the resulting NGSS tag 
distribution of such a subsample. 


Figure 5 shows the top-n accuracy averaged over the 1000 
subsamples as a function of n. When compared to ground truth, 
the semantic query model achieved 51%, 73%, and 77% top-1, 


top-2, and top-3 overall accuracy, respectively, among the 24 
NGSS PE standards. In contrast, the SME achieved 48%, 68%, 
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Figure 4: Distribution of NGSS tags across subsample of data. In all, 
55 OpenStax sections are associated with tag HS-LS1-2, whereas each 
subsample randomly selects only 11 of these to be commensurate with 
the next most represented tag, HS-LS2-5. 
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Figure 5: Top-n Accuracy vs. n for 98 items of section text from the 
OpenStax Biology 2e textbook, tagged against the NGSS High School 
Life Sciences performance expectation standards (as above, n is the 
number of top predictions within which at least one ground truth 
label must fall for the prediction to be counted as correct). 


and 70% top-1, top-2, and top-3 overall accuracy, respectively. It 
should also be noted that it took the SME 520 minutes to complete 
the manual classification of the dataset, whereas our system 
completed processing in only approximately 2 minutes. 


3. CONCLUSION 


In this paper we have introduced Catalog, a NLP based content 
classification system that utilizes recent advances in transfer 
learning approaches to deeply and accurately tag educational 
content against popularly used learning standards. Unlike other 
approaches that rely on keywords or search terms and crosswalks 
between knowledge taxonomies, Catalog is built on a language 
modeling architecture that understands the deep semantic 
structure and relationship between concepts, topics, learning 
objectives and other attributes of content. We have presented early 
results from empirical studies demonstrating efficacy of our 
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approach in classifying learning content to the Next Generation 
Science Standards (NGSS). 
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ABSTRACT 


As the use of social media increases in daily life, it has also 
increased for institutions in the field of education. While there 
may be benefits for schools to use this media outlet, the privacy of 
students within those schools may be at risk when their names and 
photos are shared on such a publicly accessible domain. In this 
study, we analyzed the extent to which students’ privacy is 
protected by qualitatively coding a random sample of 100 
Facebook posts made by U.S. school districts from a population of 
over 9.3 million photo posts that we collected. Using inferential 
techniques, we found that students are somewhat protected 
compared to teachers and community members, with only 2.67% 
of students’ detected faces able to be identified by name. The 
same measure for staff and community members were 4.6% and 
16%, respectively. These numbers at first appear small, but if 
applied to the entire population, this could potentially leave 
between 153,218 and 1,153,844 students identifiable to anyone on 
the internet. We discuss the severity and scale of these privacy 
threats and make recommendations for research on student 
privacy in social media and other informal education-related 
contexts. 


Keywords 
Privacy, Social Media, Facebook, Educational Institutions, Facial 
Recognition 


1. INTRODUCTION & PRIOR RESEARCH 


As the number of people using social media has increased, the 
risks to the privacy of social media users have also increased [23], 
and this is particularly true since social media use expands into 
areas of our lives that it did not previously occupy. Education is 
one such domain in which social media use is now widespread [2, 
10, 11, 13, 21, 22]—and is one domain for which the privacy risks 
from social media use, in general, may be compounded because of 
the centrality of a particularly vulnerable population, minors at 
school. 


For students in any given school district, the use of their name or 
face for social media may present notable privacy concerns. As 
many social media posts are made publicly available, they may be 
accessed by unexpected sets of individuals, even by those without 
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There is past research on the intersection of privacy and social 
media. For example, Fiesler and Proferes [6] examined what 
participants in social media studies thought of their data being 
used by others—particularly, by academic researchers. Only 
around one-quarter of participants in their survey study reported 
being comfortable with their data being used without being 
informed of such use. 


Related lines of research explored the intersection of privacy and 
social media data for students. For instance, Ifenthaler and 
Schumacher [9] surveyed students about what they thought of 
their data being used in learning analytics systems. They found 
that while students expressed comfort with sharing some types of 
data (i.e., data on their course enrollments, for which less than 
20% of students reported reluctance with sharing such data), for 
others, students were much less comfortable. Notably, 
highly-personal data, such as medical records, data on one’s 
personal income, and externally-produced data, including social 
media, were among those that students were the least willing to 
share. Less than 10% of students reported being willing for 
externally-produced data to be used within learning analytics 
systems. Other scholars have shown that pre-service teachers are 
highly-uncomfortable with how social media companies use 
students’ social media data, with more than two-thirds of teachers 
expressing discomfort with such uses [16]. 


While past research has explored the willingness of social media 
participants and students to share their data for research, a 
different—institutional rather than personal—context for social 
media use presents potentially notable privacy risks. Namely, past 
research has shown that both post-secondary [13] and K-12 
educational institutions use social media extensively; particularly, 
Twitter and Facebook [10, 11]. However, to this point, no research 
has yet investigated privacy in the context of social media use by 
K-12 educational institutions. 


This topic—K-12 institutions’ use of social media from a privacy 
perspective—is relevant and timely for a number of reasons. 
Recent research has shown that institutions are very active on both 
Twitter and Facebook, being associated with more than 300,000 
posts/month from the accounts of K-12 districts and schools [11]. 
As a consequence, there could be hundreds of thousands of 
students with their identities being posted in a highly-public, 
searchable, persistent record, and in a way that could be misused 
in the future. In addition, these posts may contain information that 
would typically be thought of as information which should not be 
shared publicly and widely, but which may be shared because of 
limited understanding of how widely such posts (on public pages) 
can be viewed. The audiences of institutions are likely much 
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greater than that of individual educators, meaning any potential 
privacy concerns may be much larger than that of independent 
users’ accounts. Raising awareness of this issue may prompt some 
reflection on the part of those sharing this information. 


This study involves an initial investigation into the extent to 
which students’ privacy is protected through analysis of Facebook 
posts made by public schools and school districts. In doing so, we 
ask a question about the nature of Facebook. Facebook claims that 
its site is “quick and easy,” [5] but the expediency and facility 
with which K-12 administrators and educators may use the 
platform may mean that it is also easy for school districts and 
schools to violate the privacy of students—with potentially 
difficult-to-anticipate negative ramifications at present and in the 
future. In particular, we aim to explore the degree to which the 
privacy of students might be compromised through public 
Facebook posts guided by the following research questions: 

1. To what extent can students be identified by name and 
photo on public Facebook pages of schools and school 
districts? 

2. How does the identifiability of students compare to that 
of staff and community members? 


2. METHOD 
2.1 Sample 


We used a public data mining methodology, one that draws from 
educational data mining techniques [1, 7], but which is 
distinguished by the use of (largely unstructured) publicly 
available data, such as data from websites and social media 
platforms [12]. Specifically, to obtain our sample of 100 schools’ 
and school districts’ Facebook posts, we used CrowdTangle, 
Facebook’s platform for providing academics and journalists 
access to data about public content on Facebook, including the 
content of posts and links to associated media as well as their 
timestamps and number of comments and likes (and other 
interactions) [4]. This content includes historical data from public 
Facebook pages with more than 50,000 likes and verified profiles. 
In addition, individuals with access to CrowdTangle can access 
public pages—but not individual users’ pages. 


We accessed all of the posts from K-12 institutions’ public 
Facebook pages in the United States, having obtained the URLs to 
15,728 educational institutions’ Facebook pages. We did so by 
using the statistical software R [20] to programmatically access 
(or, to webscrape) their homepages using data provided by the 
Common Core of Data [19], and recording all links to Facebook 
pages from their home pages. When schools linked to the same 
page as the district, we considered the page as a district page. The 
total study population included roughly 18 million posts shared 
from 2005-2020, with about 9.3 million of these posts including at 
least one photo. 


Carrying out a privacy-focused study ourselves, we took steps to 
protect the privacy of the individuals represented in our data. 
First, while we accessed and structured the data in a PostgreSQL 
database, we did not save the images themselves, instead using the 
Facebook posts and links therein to access the images through our 
web browser. More broadly, we determined early in our process 
that we were not prepared to analyze the photos 
algorithmically/automatically in a safe and ethical manner (e.g., 
using machine learning methods); we were concerned about 
uploading the images to a server, where they might be scanned 
and indexed. While we did not store the images in our database, 
we nevertheless took steps to protect this data, including 


permitting access only to authenticated members of the research 
team. 


From the population of approximately 18 million posts, we 
randomly sampled 100 posts with photos for this analysis. Our 
random sample of posts and related coding data were stored in a 
private Google Sheets file stored within a University Google 
Account (in part because Google is less likely—based upon past 
legislation, lawsuits and company policies—to programmatically 
search the contents of educational accounts) to which only project 
contributors had access; this ensured that any data that could 
potentially be used to identify individuals was protected. 


2.2 Measures 

We analyzed the data qualitatively using a combination of two 
commonly-used qualitative analysis techniques [8], the use of 
priori codes that we developed based upon prior research and our 
research questions as well as an exploratory process that allowed 
us to elaborate on and to substantiate those codes and to train as 
coders on the use of the coding frame. In particular, we analyzed 
the data in two ways, as we describe next. 


First, to determine whose privacy was at risk using our sample of 
100 posts, we accessed the images from each post through 
photo-specific URLs that are included along with information for 
each post in the data. Each image was accessed and analyzed 
individually. When there were more than ten images included in a 
post, we analyzed the first ten, reasoning that these first 10 were 
the most likely to be seen by viewers of the post. Each post of our 
sample was analyzed by two trained coders to evaluate the levels 
of identification for all names and faces included. Upon analysis 
of 15 posts, we drew three categories from similar research to 
distinguish individuals included in posts based on their role in the 
school or school district community [18]: 


e@ Students: Any minor assumed to be enrolled in a school 
and/or participating in a school hosted event or activity. 

e Staff: Any known employee of the school or school 
district; including but not limited to teachers, 
administrators, paraprofessionals, and communications 
directors. 

@ Community Members: Any member of the school 
community who is not a verifiable student or staff 
member, including but not limited to parents, school 
board members, local business owners, and volunteers. 


Second, to determine how individuals’ privacy may be threatened, 
we developed a coding frame that we used to assess whether 
individuals’ names and/or photos of individuals were shared in 
posts, and whether it was possible to readily connect individuals’ 
names and photos of them. We will next describe our qualitative 
coding process for applying this coding frame. 


2.3 Qualitative Coding 


Coding proceeded by first determining the classification (student, 
staff, or community) of each individual detected by name or photo 
in a post, and then identifying the number of different first and 
last names included in the text of the post, as well as the number 
of individual faces shown in the posts’ images. In particular, the 
following four elements were recorded for each category of 
individuals: 
@ Number of First and Last Names in Post 
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First and last names were recorded separately within each 
category due to the fact that staff and community members are 
often mentioned using their professional prefix (Mr., Mrs., Dr. , 
etc.) and only their last name. 
e@ Number of Faces in Images 
For identifying the presence of individuals’ faces, a detectable 
face was considered to be one for which three out of four of the 
following features were visible without enlarging the image: 1) 
eyes, 2) nose, 3) ears, and 4) mouth. Any faces appearing in more 
than one photo within the entire post were only counted once. 
e@ How many Names and Faces Connected 

We looked in posts for specific indicators of an individual’s 
location in an image, including the order in which individuals 
appear in an image or labels on images. In general, identifiability 
criteria appeared as any text that explicitly stated which name 
matched with which face in which image. 

Our coding included an interrater reliability check for 
15 posts. Two coders coded these 15 posts individually using the 
coding process outlined above. Agreement percentages for 
detecting names, detecting faces, and identifying faces were 
100%, 77.77%, and 93.33% respectively. Total agreement 
between coders across all codes was 92.34%. 


2.4 Illustration of the Coding Process 


Image 1. Example Posts 


To illustrate the coding process, we provide two example posts 
and how we coded them above (Image 1). In the image for the 
first example, two student names, both first and last, are included 
in the text of the post, and multiple student faces are included in 
the three images of the post. The “third and fourth graders playing 
soccer” are not named individually and cannot be distinguished 
from each other. The two listed student names, which have been 
covered along with their faces for their protection, are identified 
by their locations in the images, thus making their faces 
identifiable by name as well. In the second image, the post 
included the name of a staff member, as well as detectable faces 
of two staff members. Without clarification, neither of these faces 
could be identified with the name mentioned. 


2.5 Inferential Analysis 

To analyze data to answer our first question, on how students can 
be identified by name and photo on public posts by K-12 
educational institutions, we evaluated the percentages of student 
faces that were able to be identified by name; for example, if, 
across the 100 posts, we detected 50 student faces in images, and 
one was identifiable by name, then the percentage of identifiable 


students would be 1% (rather than 2%, because we were 
interested in making inferences on the basis of the number of 
identifiable students per post). We refer to this value in our results 
as the percentage of identifiable faces per post. Then, based on the 
observed frequencies (from which we calculated these 
percentages), we calculated binomial 95% confidence intervals for 
the ratio of identifiable faces and categories of faces. We did this 
to present an initial set of estimates for how many faces in our 
population of 9.3 million photo posts may be identifiable. 


To answer our second research question on relative differences in 
identifiability of individuals from different groups, we carried out 
the same analysis as above (for students) for teachers and 
community members. Then, to compare the percentages of photos 
with identifiable individuals across categories, we calculated a 
different percentage than for RQ #1, one based not upon the 
number of posts (i.e., one identifiable face across 100 posts; 1%), 
but, rather, one based upon the total number of faces detected for 
people in each category. For instance, if there were 50 faces of 
students detected, and one was identifiable, then the percentage 
would be 2%; we refer to this in our results as the percentage of 
identifiable faces per category sum. This number—and comparing 
the confidence intervals between groups—would allow us to 
speak to whether individuals were differentially identifiable when 
photos of them were detected, even if there were, for example, far 
more photos of students than community members detected. 


3. RESULTS 


Our coding resulted in the detection (but not identification) of 299 
faces in the images from the 100 posts in our sample. Of these 299 
faces, only 13 (4.35% of all detected faces [2.33%, 7.32%]) were 
able to be identified with the individuals’ name from the text of 
the post. 


RQ #1. These 13 identifiable faces were identified within 12 
individual posts from schools or districts. Student faces comprised 
5 of those 12 and thus, for every 100 posts, we estimated that 
there were 5 identifiable student faces, representing the rate of a 
single identifiable student face for every twenty posts. Put another 
way, we estimated that 5% ([1.64%, 11.28%]) of these posts 
contained identifiable student faces. While this rate is relatively 
low, if used to make an inference about the population of photo 
posts we collected, this would suggest that between 153,218 and 
1,053,844 students could potentially be identified via their 
inclusion in school or school districts’ posts. 


RQ #2. For students, 187 faces were detected in photos and only 5 
of those 187 faces were able to be identified by their names, 
meaning that 2.67% ([0.87%, 6.13%]) of student faces were 
identifiable by name. Similar percentages are given below for 
each of the other categories. These numbers indicate that students 
and staff had a much smaller percentage of identifiable faces than 
that of community members. The rest of our results are shown in 
the table below (Table 1). 


Table 1. Identifiability Percentages by Catego 
Category Percentage of 
Identifiable 
Faces per Post 


Percentage of 
Identifiable 
Faces Per 
Category Sum 


[0.87%, 6.13%] 


Student 187 5 5% 
[1.64%, 11.28%] 
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Staff 87 4 4% 4.6% 
[1.10%, 9.92%] [1.27%, 11.36%] 

Community 25 4 4% 16% 
[1.10%, 9.92%] [4.54%, 36.08%] 


4. DISCUSSION 
4.1 Key Findings 


Upon the completion of coding our sample and numerical analysis 
for each category, we are able to make a few important claims 
about the protection of student privacy. First, students comprised a 
majority of the faces detected in images; however, compared to 
the large number of student faces, less than 3% of those faces 
were able to be identified by their names. While this low 
proportion may seem to indicate that students’ privacy is 
well-protected, the massive scope of this data (more than nine 
million public posts by schools or school districts) nevertheless 
means that many students are at-risk to be identified by both face 
and name by anyone with internet access if expanded to the entire 
data set. In short, K-12 institutions’ uses of social media could 
introduce very widespread threats to students’ privacy. 


How serious are these privacy threats? An identifiable photo 
presents a relatively low risk compared to, for example, one’s 
address or grade-related information being shared. However, the 
risk of doing so is not zero: These posts could be used to identify 
information about individuals, and when accessed, could 
potentially be used to predict their personal characteristics, even 
those that require making strong inferences, such as those about 
individuals’ political identities [14]—and, potentially, other 
identities. Adding to the problem, we note that each of these posts 
not only associates a name with a photo, but also an identifiable 
photo to a particular location (a school or district) at a specific 
time. In summation, what seems like a low-risk form of 
identification, can reveal quite a bit of information on students, 
leaving their privacy vulnerable. 


In addition to the number of photos of students that were able to 
be identified, the level of protection attached to the privacy of 
students was intriguing when compared to that of students and 
staff. More specifically, while students had the highest number of 
faces detected in images, their isolated level of identifiability was 
the lowest of all three categories. We can also note that students 
and staff members together have drastically lower isolated levels 
of identifiability compared to that of community members: 
Community members were generally easier to distinguish 
between than our other categories. 


Taken together, these findings speak to concerns about privacy on 
social media, revealing that not only individuals’ actions and posts 
(e.g. [6, 9, 23]), but also those of educational institutions may 
pose risks for the privacy of a vulnerable societal group: minors at 
school. They suggest that the wide use of Facebook and ease of 
accessing posts coupled with identifiable posts of students may 
make this particular use of social media a key avenue through 
which students’ privacy is compromised. In this way, these 
findings add to prior research pointing out that young people may 
view privacy differently [17]. In addition, this research suggests to 
the educational data mining community that privacy risks to 
students may appear in unexpected contexts—and in contexts for 
which schools may, technically, not be violating the United States’ 
Family Educational Rights and Privacy Act (FERPA), but which 
may be deserve greater scrutiny. 


4.2 Limitations and Recommendations 

This study represents an initial exploration of a topic that has been 
investigated extensively using other data sources and populations 
[6, 9]—and which could be investigated much further to better 
understand the nature of how students’ privacy may be threatened 
due to the increasingly widespread use of social media by K-12 
educational institutions. Due to the small size of our sample 
(compared to that of the population of photo posts), while we 
made some inferences from our sample to the population, these 
were associated with very wide ranges of plausible values: for 
example, we estimated that the number of identifiable students 
ranges from between 150,000 and more than one million, a range 
that makes it difficult to inform other researchers as well as 
administrators, educators, parents, and students about the scale of 
the threat to students’ privacy. In addition, there are certain 
statistical inferences that we are unable to make at this time: For 
instance, with a small number of posts from varying years, we 
must code a larger sample to be able to model change in privacy 
risk over time. 


It is important to consider the issue of parent consent in the 
context of student photos via public pages of schools and districts. 
While our sample data does not include specific information on 
each educational instituation’s privacy policies, there has been 
past research performed regarding actions such as consent forms 
[3]. Students’ parents or legal guardians typically act as their 
agents of consent, which may appear to legitimize the publicizing 
of student faces. However, those making these crucial decisions 
may not have all of the necessary information to make these 
choices on behalf of their students. 


Future research may expand on the findings presented in this 
study by not only coding a greater number of posts, but also 
coding for different features of them. For instance, we noted that 
because many images in the latter part of 2020 included students 
wearing masks, there may surprisingly be a decrease in the 
number of identifiable faces during the COVID-19 pandemic. 
Future research that aims to mitigate risks may also note some of 
the features of posts which protect the privacy of students, and 
posts by schools or districts that achieve some of the benefits of 
educational institutions’ social media use. How accessible the 
posts we accessed via both the CrowdTangle [4] platform and 
other (authorized or unauthorized--e.g., through web-scraping) 
means is another topic future scholarship can explore in greater 
depth, as the extent to which others can reproduce our analysis has 
a bearing on how extensive the threats to students’ privacy are. 
Limiting risks to students’ privacy may serve as a model to inform 
or prompt reflection on the part of the administrators and 
educators using their school’s or school districts’ Facebook 
account. Finally, future research might investigate what key 
stakeholders--students, parents, and teachers--think of the 
potential privacy risks around social media use. While past 
research has reported that teachers are uncomfortable with how 
social media platforms use student data [16], our results suggest 
that key individuals in schools may not draw connections between 
this lack of comfort and how their school or district uses social 
media, and survey research methods may compliment our public 
data mining approach. 
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ABSTRACT 


Collaboration is identified as a required and necessary skill 
for students to be successful in the fields of Science, Technol- 
ogy, Engineering and Mathematics (STEM). However, due 
to growing student population and limited teaching staff it is 
difficult for teachers to provide constructive feedback and in- 
still collaborative skills using instructional methods. Devel- 
opment of simple and easily explainable machine-learning- 
based automated systems can help address this problem. 
Improving upon our previous work, in this paper we propose 
using simple temporal-CNN deep-learning models to assess 
student group collaboration that take in temporal represen- 
tations of individual student roles as input. We check the ap- 
plicability of dynamically changing feature representations 
for student group collaboration assessment and how they 
impact the overall performance. We also use Grad-CAM 
visualizations to better understand and interpret the impor- 
tant temporal indices that led to the deep-learning model’s 
decision. 


Keywords 
K-12, Education, Collaboration Assessment, Explainable, 
Deep-Learning, CNN, Grad-CAM, Cross-modal Analysis. 


1. INTRODUCTION 


Collaboration is considered a crucial skill, that needs to be 
inculcated in students early on for them to excel in STEM 
fields [24, 6]. Traditional instruction-based methods [14, 7] 
can often make it difficult for teachers to observe several stu- 
dent groups and identify specific behavioral cues that con- 
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tribute or detract from the collaboration effort [20, 15, 25]. 
This has resulted in a surge in interest to develop machine- 
learning-based automated systems to assess student group 
collaboration [{17, 11, 12, 8, 1, 9, 26, 23, 21, 4, 27, 22]. 


In our earlier work we developed a multi-level, multi-modal 
conceptual model that serves as an assessment tool for indi- 
vidual student behavior and group-level collaboration qual- 
ity [2, 3]. Using the conceptual model as a reference, in a dif- 
ferent paper we developed simple MLP deep-learning mod- 
els that predict student group collaboration quality from 
histogram representations of individual student roles [22]. 
Please refer to the following papers for more information 
and for the illustration of the conceptual model [2, 3, 22]. 
Despite their simplicity and effectiveness, the MLP mod- 
els and histogram representations lack explainability and in- 
sight into the important student dynamics. To address this, 
in this paper we focus on using simple temporal-CNN deep 
learning models to check the scope of dynamically chang- 
ing temporal representations for student group collaboration 
assessment. We also use Grad-CAM visualizations to help 
identify important temporal instances of the task performed 
and how they contribute towards the model’s decision. 


Paper Outline: Section 2 provides necessary background on 
the different loss functions used, dataset description and the 
temporal features extracted. Section 3 describes the exper- 
iments and results. Section 4 concludes the paper. 


2. BACKGROUND 


2.1 Cross-Entropy Loss Functions 

The categorical-cross-entropy loss is the most commonly used 
loss function to train deep-learning models. For a classifica- 
tion problem with C' classes, let us denote the input variables 
as x, ground-truth label vector as y and the predicted prob- 
ability distribution as p. Given a training sample (x, y), the 
categorical-cross-entropy (CE) loss is defined as 
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Table 1: Coding rubric for Level A and Level B2. 


Level A Level B2 
Group guide/Coordinator [GG] 
Effective [E] Contributor (Active) [C] 


Satisfactory [S] 
Progressing [P] 
Needs Improvement [N]] 
Working Independently [W]] 


Follower [F] 

Conflict Resolver [CR] 
Conflict Instigator/Disagreeable [C]] 
Off-task/Disinterested [OT] 
Lone Solver [LS] 


Table 2: Inter-rater reliability (IRR) measurements. 


Level | Average Agreement | Cohen’s Kappa 
A 0.7046 0.4908 
B2 0.6741 0.5459 
o 
CEx(p,y) = — 5° yi log(pi) (1) 
i=1 


Here, p; denotes the predicted probability of the i-th class. 
Note, both y and p are of length C, with >; yi = 0; pi = 1. 
From Equation 1, it’s clear that for imbalanced datasets the 
learnt weights of the model will be biased towards classes 
with the most number of samples in the training set. Ad- 
ditionally, if the label space exhibits an ordered structure, 
the categorical-cross-entropy loss will only focus on the pre- 
dicted probability of the ground-truth class while ignoring 
how far off the incorrectly predicted sample actually is. These 
limitations can be addressed to some extent by using the 
ordinal-cross-entropy (OCE) loss function [22], defined in 
Equation 2. 


OCEx(p, y) = — (1+ w) }° yi log(pi) (2) 


w = |argmax(y) — argmax(p)| 


Cc 
i= 


Here, (1+ w) represents the weighting variable, argmax re- 
turns the index of the maximum valued element and |.| re- 
turns the absolute value. When training the model, w = 0 
for correctly classified training samples, with the ordinal- 
cross-entropy loss behaving exactly like the categorical-cross- 
entropy loss. However, for misclassified samples the ordinal- 
cross-entropy loss will return a higher loss value. The in- 
crease in loss is proportional to how far away a sample is 
misclassified from its ground-truth class label. 


2.2 Dataset Description 

We collected audio and video recordings from 15 student 
groups, across five middle schools. Out of the 15 groups, 
13 groups had 4 students, 1 group had 3 students, and 1 
group had 5 students. The student volunteers completed 
a brief survey that collected their demographic information 
and other details, e.g., languages spoken, ethnicity and com- 
fort levels with science concepts. Each group was tasked 
with completing 12 open-ended life science and physical sci- 
ence tasks, which required them to construct models of dif- 
ferent science phenomena as a team. They were given one 


Level B2 Temporal Representation 
Minute1 Minute2 Minute3 Minute4 Minute 5 


Student-4 


Minute 24 


Student-5 


Maximum Task Duration 


Figure 1: Level B2 temporal representation for a group hav- 
ing only 4 students and finishing the assigned task in 4 min- 
utes. Colored cells illustrate the different Level B2 codes as 
described in Table 1, and the gray cells represent empty or 
unassigned codes. 


hour to complete as many tasks possible, which resulted in 
15 hours of audio and video recordings. They were provided 
logistic and organization instructions but received no help 
in group dynamics, group organization, or task completion. 


Next, the data recordings were manually annotated by edu- 
cation researchers at SRI International. For the rest of the 
paper we will refer to them as coders/annotators. In our 
hierarchical conceptual model [2, 3], we refer to the collabo- 
ration quality annotations as Level A and individual student 
role annotations as Level B2. The coding rubric for these 
two levels is described in Table 1. Both levels were coded by 
three annotators. They had access to both audio and video 
recordings and used ELAN (an open-source annotation soft- 
ware) to annotate. A total of 117 tasks were coded by each 
annotator, with the duration of each task ranging from 5 to 
24 minutes. Moderate-agreement was observed across the 
coders as seen from the inter-rater reliability measurements 
in Table 2. 


Level A codes represent the target label categories for our 
classification problem. To determine the ground-truth Level 
A code, the majority vote (code) across the three annotators 
was used as the ground-truth. For cases where a majority 
was not possible, we used the Level A code ordering depicted 
in Table 1 to determine the median as ground-truth of the 
three codes. For example, if the three coders assigned Satis- 
factory, Progressing, Needs Improvement for the same task 
then Progressing would be used as the ground-truth label. 
Note, we did not observe a majority Level A code for only 
2 tasks. To train the machine learning models we only had 
351 data samples (117 tasks x 3 coders). 


2.3 Temporal Representation 

In our dataset, the longest task was little less than 24 min- 
utes, due to which the length for all tasks was also set to 
24 minutes. Level B2 was coded using fixed-length 1 minute 
segments, as illustrated in Figure 1. Due to its fixed-length 
nature, we assigned an integer value to each B2 code, i.e., 
the seven B2 codes were assigned values from 1 to 7. The 
value 0 was used to represent segments that were not as- 
signed a code. For example, in Figure 1 we see a group of 4 
students completing a task in just 4 minutes, represented by 
the colored cells. The remaining 20 minutes and the 5" stu- 
dent is assigned a value zero, represented by the gray cells. 
Thus for each task, Level B2 temporal features will have a 
shape 24 x 5, with 24 representing number of minutes and 5 
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representing number of students in the group. 


Baseline Histogram Representation: We compare the perfor- 
mance of the temporal representations against simple his- 
togram representations [22]. The histogram representations 
were created by pooling over all the codes observed over the 
duration of the task and across all the students. Note, only 
one histogram was generated per task, per group. Once the 
histogram is generated we normalize it by dividing by the 
total number of codes in the histogram. Normalizing the 
histogram removes the temporal aspect of the task. For ex- 
ample, if group-1 took 10 minutes to solve a task and group- 
2 took 30 minutes to solve the same task, but both groups 
were assigned the same Level A code despite group-1 fin- 
ishing the task sooner. The raw histogram representations 
of both these groups would look different due to the dif- 
ference in number of segments coded. However, normalized 
histograms would make them more comparable. Despite the 
normalized histogram representation being simple and effec- 
tive, it fails to offer any insight or explainability. 


3. EXPERIMENTS 


Network Architecture: For the temporal-CNN deep learning 
model we used the temporal ResNet architecture described 
in [28]. The ResNet architecture uses skip connections be- 
tween each residual block to help avoid the vanishing gra- 
dient problem. It has shown state-of-the-art performance 
in several computer vision applications [10]. Following [28], 
our ResNet model consists of three residual blocks stacked 
over one another, followed by a global-average-pooling layer 
and a softmax layer. The number of filters for each residual 
block was set to 64, 128, 128 respectively. The number of 
learnable parameters for the B2 temporal representations is 
506949. We compare the performance of the ResNet model 
to the MLP models described in our previous work. Inter- 
ested readers should refer to [22] for more information about 
the baseline MLP model that was used with the histogram 
representation. 


Training and Evaluation Protocol: All models were devel- 
oped using Keras with TensorF low backend [5]. We used the 
Adam optimizer [13] and trained all models for 500 epochs. 
The batch-size was set to one-tenth of the number of train- 
ing samples during any given training-test split. We opti- 
mized over the Patience and Minimum-Learning-Rate hy- 
perparameters, that were set during the training process. 
We focused on these as they significantly influenced the 
model’s classification performance. The learning-rate was 
reduced by a factor of 0.5 if the loss did not change after a 
certain number of epochs, indicated by the Patience hyper- 
parameter. We saved the best model that gave us the lowest 
test-loss for each training-test split. We used a round-robin 
leave-one-group-out cross validation protocol. This means 
that for our dataset consisting of g student groups, for each 
training-test split we used data from g — 1 groups for train- 
ing and the left-out group was used as the test set. This 
was repeated for all g groups and the average result was re- 
ported. For our experiments g = 14 though we have tempo- 
ral representations from 15 student groups. This is because 
all samples corresponding to the Effective class were found 
only in one group. Due to this reason and because of our 
cross-validation protocol we do not see any test samples for 
the Effective class. 


Table 3: Weighted precision, weighted recall and weighted 
Fi1-score Mean+Std for the best MIP and ResNet models 
under different settings. 


Weighted Weighted Weighted 

Precision Recall F1-Score 
SVM 84.45+13.43 | 73.19416.65 | 76.92+15.39 

MLP - Cross-Entropy Loss 83.72+16.50 | 86.42+10.44 | 84.40+13.85 


MLP - Cross-Entropy Loss 83.93+£17.89 | 85.294+14.37 | 84.16+16.23 
+ Class-Balancing 


Feature Classifier 


B2 


Histogram | \i1p - Ordinal-Cross-Entropy Loss | 86.96£14.56 | 88.78++10.36 | 87.03+413.16 
MLP - Ordinal-Cross-Entropy Loss| 6 73414.43 | g8.2049.66 | 86.60212.54 
+ Class-Balancing 
ResNet - Cross-Entropy Loss 84.75+13.21 | 83.10411.92 | 82.724+12.74 
ResNet - Cross-Entropy Loss Ps 3 
Bo *, Clase Balancing 84.03415.13 | 83.284+11.42 | 82.974+12.84 


Temporal | ResNet - Ordinal-Cross-Entropy Loss | 85.24+15.68 | 87.23+10.52 | 85.564£13.38 
ResNet - Ordinal-Cross-Entropy Loss 


+ Class-Balancing 84.34+15.75 | 87.88-411.22 | 85.68+13.58 


3.1 Temporal vs Histogram Representations 
Here, we compare the performance of the ResNet and MLP 
models. Using the weighted Fl1-score performance, Table 3 
summarizes the best performing ResNet and MLP models 
for the different feature-classifier variations. The table also 
provides the weighted precision and recall metrics. Bold val- 
ues in the table represent the best classifier across the differ- 
ent feature-classifier settings. The ordinal-cross-entropy loss 
with or without class-balancing shows the highest weighted 
Fl1-score performance for both feature types. Here, class- 
balancing refers to weighting each data sample by a weight 
that is inversely proportional to the number of data samples 
corresponding to that sample’s ground-truth label. 


At first glance, the ResNet models perform slightly less than 
the MLP models. This could easily lead us to believe that 
simple histogram representations are enough to achieve a 
higher classification performance than the corresponding tem- 
poral representations. However, despite the performance dif- 
ferences, the temporal features and ResNet models help bet- 
ter explain and pin-point regions in the input feature space 
that contribute the most towards the model’s decision. This 
is important if one wants to understand which student roles 
are most influential in the model’s prediction. We will go 
over this aspect in more detail in the next section. 


3.2. Grad-CAM Visualization 


Grad-CAM uses class-specific gradient information, flowing 
into the final convolutional layer to produce a coarse local- 
ization map that highlights the important regions in the 
input feature space [19]. It is primarily used as a post 
hoc analysis tool and is not used in any way to train the 
model. Figure 2 illustrates how Grad-CAM can be used for 
our classification problem. We show two different samples 
from the Satisfactory, Progressing and Needs Improvement 
classes respectively. Each sample shows a group consist- 
ing of 4 students that completed the task in 5 to 8 minutes. 
Technically one can obtain C Grad-CAM maps for a C-class 
classification problem. Here, the samples shown correspond 
to the class predicted by the ResNet model, which is also 
the ground-truth class. It’s clear how the Grad-CAM high- 
lights regions in the input feature space that contributed 
towards the correct prediction. For instance, in the Needs 
Improvement examples, the Grad-CAM map shows the high- 
est weight on the fourth minute. At that time for the first 
example, the codes for three of the students become Off- 
task/Disinterested. Similarly, for the second example we 
notice three of the students become Lone Solvers and the 
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Figure 2: Grad-CAM visualization for two different temporal samples from different Level-A classes. 


fourth student becomes a Follower. This is in stark contrast 
to the minute before when two of the students were Follow- 
ers and the other two were Contributors. We also notice less 
importance being given to the Empty codes. These changes 
in roles and the Grad-CAM weights across the task make 
sense and help promote explainability in our deep learning 
models. 


4. CONCLUSION 


In this paper we proposed using simple temporal represen- 
tations of individual student roles together with temporal 
ResNet deep-learning architectures for student group col- 
laboration assessment. Our objective was to develop more 
explainable systems that allow one to understand which in- 
stances in the input feature space led to the deep-learning 
model’s decision. We suggested use of Grad-CAM visualiza- 
tion along the temporal dimension to assist in locating im- 
portant time instances in the task performed. We compared 
the performance of the proposed temporal representations 
against simpler histogram representations from our previ- 
ous work [22]. While histogram representations can help 
achieve high classification performance, they do not offer 
the same key insights that one can get using the temporal 
representations. 


Limitations and Future Work: The visualization tools and 
findings discussed in this paper can help guide and shape 
future work in this area. Having said that our approach 


can be further extended and improved in several ways. For 
example, we only discuss Grad-CAM maps along the tem- 
poral dimensions. This only allows us to identify impor- 
tant temporal instances of the task but does not focus on 
the important student interactions. The current setup does 
not tell us which subset of students are interacting and how 
that could affect the overall group dynamic and collabora- 
tion quality. To address this we intend on exploring other 
custom deep-learning architectures and feature representa- 
tion spaces. We also plan on using other tools like LIME [18] 
and SHAP [16]. These packages compute the importance of 
the different input features and help towards better model 
explainability and interpretability. Also we only focused on 
mapping deep learning models from individual student roles 
to overall group collaboration. In the future we intend on ex- 
ploring other branches in the conceptual model, described in 
[2, 3]. We also plan on developing recommendation systems 
that can assist and guide students to improve themselves 
by suggesting what they need to take on. The same sys- 
tem could also be tweaked specifically for teachers to give 
them insight on how different student interactions could be 
improved to facilitate better group collaboration. 
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ABSTRACT 


In this paper, we introduce the Commonlit Ease of Readability 
(CLEAR) corpus. The corpus provides researchers within the 
educational data mining community with a resource from which 
to develop and test readability metrics and to model text 
readability. The CLEAR corpus improves on previous readability 
corpora include size (N = ~5,000 reading excerpts), the breadth of 
the excerpts available, which cover over 250 years of writing in 
two different genres, and the readability criterion used (teachers’ 
ratings of text difficulty for their students). This paper discusses 
the development of the corpus and presents reliability metrics as 
well as initial analyses of readability. 


Keywords 


Text readability, corpus linguistics, pairwise comparisons 


1. INTRODUCTION 


Reading is an essential skill for academic success. One important 
way to support and scaffold literacy challenges faced by students 
is to match text difficulty to their reading abilities. Providing 
students with texts that are accessible and well matched to their 
abilities helps to ensure that students better understand the text 
and, over time, can help readers improve their reading skills. 
Readability formulas, which provide an overview of text 
difficulty, have shown promise in more accurately benchmarking 
students with their text difficulty level, allowing students to read 
texts at target readability levels. 


Most educational texts are matched to readers using traditional 
readability formulas like Flesch-Kincaid Grade Level (FKGL) 
[19] or commercially available formulas such as Lexile [30] or the 
Advantage-TASA Open Standard (ATOS) [29]. However, both 
types of readability formulas are problematic. Traditional 
readability formulas lack construct and theoretical validity 
because they are based on weak proxies of word decoding (i.e., 
characters or syllables per word) and syntactic complexity (i.e., 
number or words per sentence) and ignore many text features that 
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are important components of reading models including text 
cohesion and semantics. Additionally, many traditional readability 
formulas were normed using readers from specific age groups on 
small corpora of texts taken from specific domains. Commercially 
available readability formulas are not publicly available, may not 
have rigorous reliability tests, and may be cost-prohibitive for 
many schools and districts let alone teachers. 


In this paper, we introduce the open-source the CommonLit Ease 
of Readability (CLEAR) corpus. The corpus is a collaboration 
between CommonLit, a non-profit education technology 
organization focused on improving reading, writing, 
communication, and problem-solving skills, and Georgia State 
University (GSU) with the end goal of promoting the 
development of more advanced and open-source readability 
formulas that government, state, and local agencies can use in 
testing, materials selection, material creation, and other 
applications commonly reserved for readability formulas. The 
formulas that will be derived from the CLEAR corpus will be 
open-source and ostensibly based on more advanced natural 
language processing (NLP) features that better reflect the reading 
process. The accessibility of these formulas and their reliability 
should lead to immediate uptake by students, teachers, parents, 
researchers, and others, increasing opportunities for meaningful 
and deliberate reading experiences. We outline the importance of 
text readability along with concerns about previous readability 
formulas below. As well, we present the methods used to develop 
the CLEAR corpus. We then examine how well traditional and 
newer readability formulas correlate with the reading criteria 
reported in the CLEAR corpus and discuss next steps. 


2. TEXT READABILITY 


Text readability can be defined as the ease with which a text can 
be read (i.e., processed) and understood in terms of the linguistic 
features found in that text [9][27]. However, in practice, many 
readability formulas are more focused on measuring text 
understanding (e.g., [18]) than text processing. 


Text comprehension is generally associated with word 
sophistication, syntactic complexity, and discourse structures 
[17][31], three features whose textual elements relate to text 
complexity. For example, many studies have revealed that word 
sophistication features such as sound and spelling relationships 
between words [16][25], word familiarity and frequency [15], and 
word imageability and concreteness [28] can result in faster word 
processing and more accurate word decoding. The meaning of 
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words, or semanticity, also plays an important role in text 
readability, in that readers must be able to recognize words and 
know their meaning [26]. Therefore, word semanticity and larger 
text segments can facilitate the linking of common themes and 
easier processing based on background knowledge and text 
familiarity [1][23]. 


Effective readers should also be able to parse syntactic structures 
within a text to help organize main ideas and assign thematic roles 
where necessary [13][26]. Two features that allow for quicker 
syntactic parsing are words or morphemes per t-unit [8] and 
sentence length [21]. Parsing information in the text helps readers 
develop larger discourse structures that result in a discourse thread 
[14]. These structures, which relate to text cohesion, can be 
partially constructed using linguistic features that link words and 
concepts within and across syntactic structures [12]. Sensitivity to 
these cohesion structures allows readers to build relationships 
between words, sentences, and paragraphs, aiding in the 
construction of knowledge representations [4][20][23]. Moreover, 
such sensitivity can help readers understand larger discourse 
segments in texts [11][26]. 


Traditional readability formulas tend use only proxy estimates for 
measuring lexical and syntactic features. Moreover, they disregard 
the semantic features and discourse structures of texts. For 
instance, these formulas ignore text features including text 
cohesion [4][20][23][24] and style, vocabulary, and grammar, 
which play important roles in text readability [1]. Additionally, 
the reading criteria used to develop traditional formulas are often 
based on multiple-choice questions and cloze tests, two methods 
that may not measure text comprehension accurately [22]. Finally, 
traditional readability formulas are suspect because they have 
been normed using readers from specific age groups and using 
small corpora of texts from specific domains. 


Newer formulas, both commercial and academic, generally 
outperform traditional readability formulas. These formulas rely 
on more advanced NLP features, although this may not be the 
case with commercial formulas for which text features within the 
formulas are proprietary and, thus, not publicly available. Newer 
formulas come with their own issues though. For instance, 
commercially available formulas, such as the Lexile framework 
[30] and the Advantage-TASA Open Standard for Readability 
(ATOS) formula [29], often lack suitable validation studies. In 
addition, accessing commercially available formulas may come at 
a financial cost that is unaffordable for some schools and 
education technology organizations. Academic formulas such as 
the Crowdsourced Algorithm of Reading Comprehension 
(CAREC) [7] have been validated through rigorous empirical 
studies, are transparent in their underlying features, and are free to 
the public. However, the datasets on which they have been 
developed, while much larger than traditional readability 
formulas, can still be considered as relatively small and specific. 
The populations the formulas are trained on (1.e., adults) may also 
not generalize well to other target populations like young students. 


3. CURRENT STUDY 


We hope to spur innovation to address many of the concerns 
noted above in reference to both traditional and newer readability 
formulas by publicly releasing the CommonLit Ease of 
Readability (CLEAR) corpus as well as hosting an open-source 
competition to develop readability formulas based on the CLEAR 
corpus. We hope that these formulas outperform existing 
readability formulas and can be used to better match 3"¢-12" grade 


students to texts, thus improving learning outcomes in primary 
and secondary classrooms. 


4. THE CLEAR CORPUS 
4.1 Corpus Collection 


We collected text excerpts from the CommonLit organization’s 
database, Project Gutenberg, Wikipedia, and dozens of other open 
digital libraries. Excerpts were selected from the beginning, 
middle, and end of texts and only one sample was selected per 
text. Text excerpts were selected to be between 140-200 words, 
with all excerpts beginning and ending at an idea unit (i.e., we did 
not cut excerpts in the middle of sentences or ideas). The text 
excerpts were written between 1791 and 2020, with the majority 
of excerpts selected between 1875 and 1922 (when copyrights 
expired) and between 2000 and 2020 (when non-copyright texts 
were available on the internet). Visualizations of these trends are 
available in Figure 1. 


Figure 1 
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Excerpts were selected from two genres: informational and 
literature texts. We started with an initial sample of ~7,600 texts. 
Each excerpt was read by at least two raters and judged on 
acceptability. The two major criteria for acceptability were the 
likelihood of being used in a 3rd-12th grade classroom and 
whether or not the topic was appropriate. We used Motion Picture 
Association of America (MPAA) ratings (e.g., G, PG, PG-13) to 
flag texts by appropriateness. Texts that were flagged as 
potentially inappropriate were then read by an expert rater and 
either included or excluded from the corpus. We also conducted 
automated searches for traumatic terms (e.g., terms related to 
racism, genocide, or sexual assault). Any excerpt flagged for 
traumatic terms was also reviewed by an expert rater. Lastly, we 
limited author representation such that each author had no more 
than 12 excerpts within the corpus. After removing excerpts based 
on these criteria, we were left with 4793 excerpts. These excerpts 
were copy-edited to ensure texts did not contain grammatical, 
syntactic, and spelling errors. Punctuation was also standardized 
in the texts, as were line-breaks. Lastly, selected archaic spellings 
(e.g., to-day, Servia) were replaced with modern spellings (e.g., 
today, Serbia) and identified British English spellings were 
converted to American spellings. 


4.2 Human Ratings of Readability 

We recruited ~1,800 teachers from the CommonLit teacher pool 
through an e-mail marketing campaign. Teachers were asked to 
participate in an online collection experiment. They were 
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expected to read 100 pairs of excerpts and make a judgment for 
each pair as to which excerpt was easier to understand. Teachers 
were paid $50 in an Amazon gift card for their participation. 


4.3 Data Collection Site 


We developed an online data collection website. The basic format 
of the site was to show two excerpts side by side and ask 
participants to judge which of the two texts would be easier for a 
student to understand using a checkbox format. There were two 
additional buttons on the website. The first moved the participant 
to the next comparison and the second allowed participants to 
pause the experiment. The website also included a progress tally 
to show participants how many comparisons they had made (see 
Figure 2 for screenshot of pairwise comparison task). 


Figure 2 


(1) Read both texts. (2) Answer the question. (3) Click “Rate Next Set” 
Text Ten? 


Which text is easier for students to understand? 
Tent Tenta 


RATE NEXT SET 


(0/100 trials completed 


The website first provided participants with informed consent and 
an overview of the expectations. The website then collected 
simple demographic information and survey information about 
reading/writing and television habits. Participants were then given 
a practice excerpt comparison to familiarize them with the design. 
After the practice comparison, participants moved forward with 
the data collection. Excerpts were paired randomly, and excerpts 
were shown on either the right or left-side panel randomly. The 
licensing information and the uniform resource locator (URL) for 
each text were displayed on the bottom side of each panel. 
Participants were redirected to a break screen after completing 
every 20 comparisons. The break screen showed how much time 
(in total and per comparison) the participant had spent on the task. 
A button allowing the participant to continue to the next 
comparison appeared after spending one minute on the break 
screen, meaning that the participants were required to take at least 
a one-minute break per 20 comparisons. After completing 100 
comparisons, the participants were given a completion code that 
they could redeem for the gift card. The website was written in 
Python, JavaScript, CSS, and HTML. The website was housed on 
a cloud server. 


4.4 Participant Reliability 

Of the ~1,800 participants that initially logged into the 
experiment, 1,198 completed the entire experiment. However, not 
all participant data was kept. We removed participants who did 
not complete the entire experiment. We also removed participants 
to increase the reliability of the pairwise scores based on deviant 
patterns and time spent on judgments. In terms of deviant patterns, 
we removed all participants who selected excerpts in either the 
right or left panel more than 70% of the time. We also removed 
participants who had binary patterns of selecting left/right or 
right/left panels more than 20 times in a row. In terms of time 


spent on judgments, we removed participants who spent less than 
10 seconds on average per comparison and/or spent a median time 
under 5 seconds. After removing participants based on patterns 
and time, we were left with data from 1,116 participants. Those 
participants made 111,347 overall comparison judgments (M = 
99.773 judgments per participant). On average, each excerpt was 
read 46.47 times and participants spent an average of 101.36 
seconds per judgment. However, we did not remove participants 
for taking too long on judgments, especially since pauses were 
allowed. Thus, our data for time was right skewed. 


4.5 Pairwise Rankings for Readability 

To calculate pairwise comparison scores for the human judgments 
of text ease, we used a Bradley-Terry model [3]. A Bradley-Terry 
model describes the probabilities of the possible outcomes when 
items are judged against one another in pairs (see Equation 1). 
The Bradley-Terry model ranks documents by difficulty based on 
each excerpt's probability to be easier than other excerpts. The 
model creates a maximum likelihood estimate which iteratively 
converges towards a unique maximum that defines the ranking of 
the excerpts (i.e., the easiest texts have the highest probability). 


Equation 1: Bradley-Terry Model 
P( [text] _i more difficult than [text] _j )=y_iM(y_ity_j) 


After computation, the Bradley-Terry model provides a 
coefficient for each text along with a standard error. We examined 
both coefficients and standard errors for outliers. We found 52 
texts that had a coefficient with a standard deviation greater than 
2.5 and additional 17 excerpts with a standard error greater than 
0.65. These were removed from the final dataset leaving us with a 
sample size of 4,724. We conducted two additional analyses of the 
final data set in terms of differences in Bradley-Terry coefficients 
between informational and literature texts and trends in the 
coefficients as a function of time of publication for the texts. 


As expected, we found significant differences between 
informational and literature texts such that informational texts 
were rated significantly more difficult (4723) = -20.95, p < 
.001), with a moderate effect size (d = -0.61). See Figure 2 for a 
box plot depicting this difference in text categories. In addition, 
we used a Pearson’s correlation test to test whether Bradley-Terry 
coefficients were correlated with the texts’ year of publication, 
finding a weak correlation, 7(4722) = .20, p < .001). Thus, more 
recent passages were often rated as simpler than older passages 
(see Figure 1). 


Figure 3 


Boxplot of BT coefficient by text category 


BT coefficient 


Text category 
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4.6 Pairwise Scoring Validation Checks 

To examine convergent validity for the pairwise scores, we 
examined correlations between the scores and classic and newer 
readability formulas. The formulas we included were Flesch 
Reading Ease, Flesch Kincaid Grade Level, the New Dale-Chall, 
and the Crowdsourced Algorithm of Reading Comprehension [7]. 
All formulas were calculated using the Automatic Readability 
Tool for English (ARTE) [6]. ARTE provides free and easy 
access to a wide range of readability formulas and is available at 
linguisticanalysistools.org. ARTE automatically calculates 
different readability formulas for batches of texts (i.e., thousands 
of texts can be run at a time) and produces readability scores for 
individual texts in an accessible spreadsheet output. ARTE was 
developed to help educators and researchers easily process texts 
and derive different readability metrics allowing them to compare 
that output and choose formulas that best fit their purpose. The 
tool is written in Python and is packaged in a user-friendly GUI 
that is available for use in Windows and Mac operating systems. 
Correlations for this analysis are reported in Table 1. 


Table 1: Correlations between readability formulas and text ease 


FRE FKGL NDC CAREC 
Text ease 0.547 -0.517 -0.557 -0.582 
FRE -0.913 -0.829 -0.726 
FKGL 0.676 0.579 
NDC 0.739 


*FRE = Flesch Reading Ease, FKGL = Flesch Kincaide Grade 
Level, NDC = New Dale Chale, CAREC = Crowdsourced 
Algorithm of Reading Comprehension 


The results indicate strong overlap between the four selected 
readability formulas and the text ease scores reported by the 
Bradley-Terry model. The strongest correlations were reported for 
CAREC while the weakest correlations were reported for FKGL. 
While strong, the correlations indicate that the readability 
formulas only predict around 27%-34% of the variance in the 
reading ease scores. Thus, there are opportunities for 
improvement in future readability formulas. 


5. DISCUSSION 


In this paper, we introduced the CommonLit Ease of Readability 
(CLEAR) corpus. The corpus provides researchers within the 
educational data mining community with a resource from which 
to develop and test readability metrics and to model text 
readability. The CLEAR corpus has a number of improvements 
over previous readability corpora, which are discussed below. 


First, the CLEAR corpus is much larger than any available 
corpora that provide readability criterion based on human 
judgments. While there are large corpora that provide leveled 
texts (e.g., The Newsela corpus), these corpora only provide 
indications of reading ability based on levels of simplification 
(i.e., beginning texts as compared to intermediate texts). The 
corpora do not provide readability criterion for individual texts. 
Individual reading criteria, like that reported in the CLEAR 
corpus, allows for the development of linear models of text 
readability. While there are other corpora that have reading 
criteria for individual texts, the corpora are much smaller (N = 
~20 - 600 texts), and they do not contain the breadth of texts 
found in the CLEAR corpus. The size of the CLEAR corpus 
ensures wide sampling and variance such that readability formulas 
derived from the corpus should be strongly generalizable to new 
excerpts. 


The breadth of excerpts found in the CLEAR corpus is an 
additional strength. The corpus was curated from the excerpts 
available on the CommonLit website, all of which have been 
specially leveled for a particular grade level. The CommonLit 
texts were supplemented by hand selected excerpts taken from 
Project Gutenberg, Wikipedia, and dozens of other open digital 
libraries. The text excerpts were published over a wide range of 
years (1791-2020) and are representative of two genres commonly 
found in the K-12 classroom: informational and literary genres. 
The texts were read by experts to ensure they matched excerpts 
used in the K-12 classroom and checked for appropriateness using 
MPAA ratings. All texts were hand edited, so that grammatical, 
syntactic, and spelling errors were limited, while punctuation was 
minimally standardized to honor the authors’ expression and style. 


A final strength is the reading criteria developed for the CLEAR 
Corpus. Previous studies have developed reading criteria based on 
cloze tests or multiple-choice tests, both of which may not 
measure text comprehension accurately [22]. Additionally, while 
many readability formulas are marketed for K-12 students, their 
readability criteria are based on a different population of readers. 
The best example of this is Flesch-Kincaid Grade Level, which 
was developed using reading tests administered to adult sailors. 
We bypass these concerns, to a degree, by collecting judgments 
from schoolteachers about how difficult the excerpts would be for 
their students to read. This provides greater face validity for our 
readability criteria, which should translate into greater predictive 
power for readability formulas developed on the CLEAR corpus. 


Lastly, while the purpose of the CLEAR corpus is for the 
development of readability formulas, the corpus includes meta- 
data that will allow for interesting and important sub-analyses. 
These analyses would include investigations into readability 
differences based on year of publication, genre, author, and 
standard errors, among many others. The sub-analyses afforded by 
the CLEAR corpus will allow greater understandings of how 
variables beyond just the language features in the excerpts 
influence text readability. 


6. FUTURE DIRECTIONS 


The next step for the CLEAR corpus is an online data science 
competition to promote the development of new open-science 
readability formulas. The competition will be hosted within an 
online community of data scientists and machine learning 
engineers who will enter a competition to develop readability 
formulas using only the reading excerpts and the reported 
standard errors to predict the Bradley-Terry ease of reading co- 
efficient scores. Prize money will be offered to increase the 
likelihood of participation. Once winners from the competition are 
announced, the winning readability formulas will be included in 
ARTE so that access to the formulas is readily available to 
teachers, students, administrators, and researchers. ARTE will 
also be expanded to include an online interface and a functional 
API. The online interface will allow end-users to easily upload 
texts to analyze for readability to better match texts to readers. 
The API will allow other educational technologies to include text 
readability formulas in their systems to help select texts for online 
students. 
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ABSTRACT 


We present an algorithm using interpretable convolutional 
neural networks for mining sequential patterns from event 
log data. The key to our approach is utilizing structured reg- 
ularization to achieve sparse parameter values that closely 
resemble the results of typical pattern mining algorithms, 
and allows the learned convolution filters to be interpreted 
easily. Our method can handle both sequences of individual, 
unique elements and concurrent multiple-element sequences, 
which represents most situations where sequences may occur 
in logs of student actions. We applied our structured reg- 
ularization method to a self-supervised problem predicting 
future actions from past actions in two different educational 
datasets as example applications. Furthermore, we gener- 
ated features from the learned patterns to evaluate the util- 
ity of patterns and trained a supervised model with these 
features to predict academic outcomes via transfer learning. 
Our algorithm improves the correlation of sequences with 
outcomes by an average of r = .131 on one dataset and r = 
.101 on the other dataset versus a traditional sequential pat- 
tern mining algorithm. Finally, we visualize the extracted 
patterns and demonstrate that they can be interpreted as a 
sequence of actions. 


Keywords 
Interpretability, pattern mining, convolutional neural net- 
works, sequential data 


1. INTRODUCTION 


Convolutional neural networks (CNNs) have been success- 
fully applied to various applications in educational data min- 
ing [2, 9, 15, 16, 22, 24]. However, CNNs have a major 
shortcoming in terms of transparency, because they typi- 
cally form “end to end” models that make high-level infer- 
ences from low-level inputs through a series of opaque layers. 
Thus, resulting models are often hard to understand and in- 
terpret. As a consequence, both instructors and students 
do not know what kinds of student behaviors actually im- 


Lan Jiang and Nigel Bosch “Predictive Sequential Pattern Mining via 
Interpretable Convolutional Neural Networks”. 2021. In: Proceed- 
ings of The 14th International Conference on Educational Data Min- 
ing (EDM21). International Educational Data Mining Society, 761-766. 
https://educationaldatamining.org/edm2021/ 

EDM ’21 June 29 - July 02 2021, Paris, France 


Nigel Bosch 
University of Illinois Urbana-Champaign 
Champaign, IL, USA 
pnb@illinois.edu 


pact predictions, which is important for understanding and 
supporting students’ learning behaviors and instructors’ reg- 
ulation of student learning. Our work addresses this inter- 
pretability issue with CNNs to provide useful features for 
student modeling applications that utilize CNNs. 


One particular application that requires understanding the 
patterns learned by a model (or any method) is sequential 
pattern mining. The typical approaches for sequential pat- 
tern mining is to identify a set of elements that frequently 
co-occur; in sequence data, that corresponds to finding a set 
of events, items, locations, etc. that often happen sequen- 
tially in the data [1, 3, 5, 6, 7, 8, 13, 14, 23, 25]. However, 
those methods of pattern discovery suffer from problems like 
pattern explosion [21], which occurs when the number of fre- 
quent patterns is myriad and the importance (or usefulness) 
of patterns is uncertain. Consequently, the large number of 
patterns and prioritization of common patterns can be es- 
pecially problematic in educational data where low-support 
patterns may be of interest (e.g., when examining uncom- 
mon patterns specific to students from underrepresented de- 
mographic groups [20]). In addition, existing methods do 
not consider the context of a pattern. We aim to train convo- 
lutional neural networks that have inherently interpretable 
features (i.e., discrete absence/presence of a specific student 
event, like watching a video or posting to a discussion forum) 
and enforce learning of patterns that predict context. We 
can thus interpret and utilize these patterns in downstream 
tasks in the same way that patterns mined via sequence 
mining methods are. 


This paper aims to train a CNN with self-supervised learn- 
ing to produce a model that can predict future actions based 
on sequences of events, thereby encouraging the model to 
learn predictive sequences (rather than frequent, unique, or 
other criteria). The features learned by self-supervised neu- 
ral networks can be shared by various downstream tasks, as 
has been demonstrated in previous research. In this paper, 
we report results from a CNN trained to predict future stu- 
dent activities from past activities, and which thus captures 
event dependencies. 


To show the effectiveness of the proposed method, we ap- 
plied it two large datasets of student actions logged in learn- 
ing environments, and utilized transfer learning to predict 
student outcomes with features derived from the discovered 
patterns. Specifically, after learning the patterns from stu- 
dent data, we generated feature representations from each 
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pattern and trained a supervised model to predict students’ 
grade outcomes. We demonstrate that in several cases our 
results outperformed and were more stable than a typical 
sequential pattern mining method. 


In summary, our contributions include two parts: 


1. We trained interpretable CNN filters to explicitly learn 
patterns consisting of either mutually-exclusive (unique) or 
concurrent (co-occurring) elements (i.e., actions). 


2. We evaluated the quality of patterns learned with our 
method in a transfer learning task involving prediction of 
students’ outcomes in two datasets. 


2. APPROACH 


Our goal is to find frequent, predictive patterns with fixed 
length given sequences of actions done by students (or events, 
or items). Each step can contain either a single unique action 
or multiple concurrent actions (depending on the dataset), 
which can be regarded as sequential actions, events, activ- 
ities, or other categorical values. In this section, we first 
explain the framework of the unique event pattern detector, 
which is the simplest case and perhaps the most widely- 
applicable. We then describe the solution for the multiple 
concurrent events pattern detector, as an illustration of how 
the unique-element approach can be generalized to other 
variations of the problem. We also describe a “warm-up” 
strategy, which is necessary to effectively train the pattern 
mining models. During the evaluation phase, we then de- 
rive features from the extracted patterns and apply them to 
a supervised student outcome prediction task as a measure 
of the quality of the patterns. 


2.1 Unique Element Patterns Detector 

We begin with the representation for each action and pro- 
pose our pattern detection model for unique element se- 
quence mining At the first stage, we use one-hot encoding 
to represent each action that was taken by students. After 
that, we use a one-dimensional CNN (i.e., convolving only 
over time), without bias weights, to extract patterns of ac- 
tion subsequences. We constrain the parameters of CNN 
filters to directly impose an interpretable, discrete structure 
on the weights. To predict future actions, we append a fully- 
connected layer and sigmoid function. 


The crux of our approach to discovering interpretable pat- 
terns of specific actions is to force each row (corresponding 
to one step in a sequence) in the CNN filters to have only 
one parameter that is close to 1, while all others are close 
to 0. To achieve the desired weight structure, we applied 
regularization to CNN filter parameters as part of training. 


In our method, the primary training objective is to minimize 
the binary cross-entropy loss for predicting future actions. 
To enforce discrete structure of the filters of CNN, we utilize 
regularization to force the sum of each row of the parameters 
of each CNN filter to 1, while most parameters are 0, thereby 
leaving only a single 1 corresponding to a single action. We 
split the approach into two steps. The first step is to ensure 
the sum of the parameters in each row is close to 1, by adding 
the loss: 


Lesum = 00 (1— D0 Wong)” (1) 


p=ln=1 


where W refers to weight of convolutional neural network, 
d is the number possible actions (i.e., the size of each one- 
hot encoded vector), & is the number of sequence steps in 
each CNN kernel (i.e., the length of pattern to learn), and 
M is the number of filters in the CNN (i.e., the number of 
patterns to learn). 


The second step is to encourage parameters to go toward 0 
via row-wise L1 loss, leaving only one parameter close to 1 
to minimize the Lr sum row sum loss. 


M ik 


Lge 5 ye |Weig| (2) 


p=1n=1 j=1 


Finally, we optimize the following joint objective function 
during training: 


L= Lprediction + aly sum + BLy i (3) 


where L, is the whole structured regularization loss, and @ 
and £6 are coefficients for each regularization part, included 
to balance the contrasting minimization objectives of Ly sum 
and De . 


2.2 Multiple Concurrent Elements Pattern De- 


tector 

The limitation of the unique action detector is that it can 
only handle situations where each step in the sequence con- 
tains exactly one action. In some circumstances, each step 
contains many actions or events, such as when a student does 
several activities logged with the same timestamp. We ex- 
tended our approach to handle this condition, following the 
model proposed in the previous section with different con- 
straints. Specially, we force each CNN kernel weight to be 
either close to 0 or close to 1, ignoring the sum of all weights 
and thus allowing multiple actions per step. To achieve this 
goal, the regularization loss for each parameter is minimized 
when the parameter is either 0 or 1. 


We operationalized this regularization goal via the following 
quadratic equation: 


M ik 


Lym = > ys 2 |Wond - Won5| (4) 


p=1n=1j=1 
Overall, the objective function is: 
L= Lprediction + yLrm (5) 


In Lym, Y serves as a weight we tuned to ensure that the 
structured regularization loss Lym has the desired effect on 
the CNN weights without over-emphasizing regularization 
relative to the prediction loss. 


The prediction loss for our multiple concurrent elements ex- 


ample is to minimize binary cross-entropy with logits loss, 
though other loss functions could be applied. 


2.3. Warm-up Period 
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Table 1: Performance comparison (Pearson’s r correlation coefficient between predicted and actual student grades) of our 
method versus CM-SPAM on two datasets. The EPM dataset has grades for five learning sessions (labeled 2-6), while OULAD 
has grades for seven courses (labeled A—G). Results without warm-up and structured regularization are provided as points of 
comparison, though the CNN filters without structured regularization are not interpretable. 


EPM OULAD 
Course 2 3 4 5 6 Course A B C D E F G 
CM-SPAM [5] -.050 603.139.134.333 318 341 .341 .440 .381 .381 .510 
Without warm-up -.032 .729 .425 .055 430 324 414 514 .433 .456 .394  .544 
Our approach -.092 .792 =.432 .227 ~©.450 330 461 .532 .500 498 .433 .563 
Traditional CNN -.199 672 .518 .209 .375 365 416 547 .511 .506 .422 .550 


Frequently, models learned a local optimum where regular- 
ization losses were immediately optimized. 'To avoid get- 
ting stuck at local optima of the objective, we introduced a 
“warm-up” period to stabilize training [17]. In our experi- 
ment, we trained the model without regularization loss for 
five epochs. Subsequently, we linearly increased the coeffi- 
cient of the regularization loss over the course of ten epochs. 


2.4 Features for Transfer Learning 

After learning the set of predictive patterns, we evaluated 
the utility of learned patterns for a subsequent prediction 
task (i.e., transfer learning with the learned patterns). We 
froze the weights of the CNN, then applied the network to 
generate pattern features for each student’s sequence of ac- 
tions. Note that each sequence is typically much longer than 
the number of steps in each CNN kernel. Thus, we aggre- 
gated filter activations for each pattern by applying basic 
statistical calculations, including sum, standard deviation, 
max, min, skew, kurtosis, and different quantiles (10%, 30%, 
50%, 70%, 90%). 


We then concatenated all of these aggregated values of all 
extracted patterns to create feature vectors. As a means 
to judge the quality of the pattern features, we predicted 
students’ learning outcomes with a random forest regression 
model [4]. 


3. EXPERIMENTS 


In this section, we first introduce the details of two datasets 
and a baseline pattern discovery algorithm, against which 
we compare our proposed method (Table 1). We use vi- 
sualization to examine the learned patterns, and compare 
transfer learning predictions of student outcomes via Pear- 
son correlations. Finally, we discuss the convergence of our 
method. 


3.1 Datasets 


We work on two public datasets that contain learning behav- 
iors of students represented by actions from different courses. 


Educational Process Mining (EPM). The EPM dataset [19] 
contains sequential records of 100 students’ activities dur- 
ing 6 laboratory sessions (5 of which have outcome labels) 
of the digital design course at the University of Genoa. Ac- 
tions were logged in sequential order, such that each row 
represented a unique action taken by a student. We de- 
scribe activities included in EPM dataset, including their 
frequency, in the Appendix. 
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Open University Learning Analytics Dataset (OULAD). The 
OULAD dataset [12] contains data about courses, students, 
actions of students, and their interactions with a virtual 
learning environment (VLE; specifically, Moodle) for seven 
courses, which started from either February or October. We 
merged multiple semesters of the same course because the 
patterns in the same courses should be relatively (if not ex- 
actly) consistent. The detail of interaction events included in 
the dataset shown in the Appendix (we merged some infre- 
quent interactions into other category because the frequency 
of occurrence of these interactions was rare). 


3.2 Baseline Comparison Method 

Typical sequential pattern mining algorithms include those 
like CM-SPAM, GSP [18], PrefixSpan [8], and SPADE [25]. 
We use CM-SPAM as a baseline method here because it can 
easily find patterns of a specific length, which allows fair 
comparison to our proposed method. 


CM-SPAM [5] is a sequential pattern mining algorithm based 

on Sequential PAttern Mining (SPAM; [3]). SPAM is a 

depth-first sequential frequent pattern search algorithm. CM- 
SPAM prunes the SPAM search space to improve computa- 

tional complexity. We focused on mining patterns with the 

highest support and matched the length of patterns in our 

method, selecting 25 of the highest-support patterns to com- 

pare against the 25 patterns learned by our method. 


3.3. Experimental Setup 

We optimized CNN models with Adam [10] for 50 epochs. 
We tuned hyperparameters including learning rate, loss co- 
efficients, and warm-up duration, based only on results in 
OULAD course A, to avoid over-fitting hyperparameters to 
the other six OULAD courses or the EPM dataset. Hyper- 
parameters related to the structure of input and the model 
architecture we left fixed. Specifically, we convolved CNN fil- 
ters of length 3 over subsequences of current events of length 
5, with stride length 1, and predicted the next 1 event. 
Models had 25 convolution filters (patterns to learn). We 
concatenated convolution filter outputs and used a fully- 
connected layer with sigmoid activation for predicting the 
next action. We found that the model worked best with the 
learning rate set to .001, after testing .01, .0075, .005, .0025, 
and .001. 


Components of the structured regularization loss have no- 


tably different magnitudes for the unique action case, since 
the L1 loss component (L,11) is several times larger than 


763 


the filter row sum component (L, 1). We thus applied a 
relatively large weight for Z,1, and small weight for L,11 to 
balance regularization terms and achieve the desired weight 
structure. We tried different ratios of a/(, including 1, 10, 
20, 30, 40, 50, 60, 70, 100, and 200. With a = 0.075 and 6 
= 0.0075, loss converged well. For the warm-up procedure, 
after testing 1 epochs, 5 epoch, and 10 epochs, we found the 
model converged best with niota: = 5 epochs. 


For evaluating pattern utility via transfer learning with ran- 
dom forest regression, a higher correlation score (Pearson’s r 
ranging from -1 to 1) represents patterns with higher utility 
for the downstream task. 


3.4 Performance Analysis 

Outcome prediction is a natural way to evaluate the utility of 
patterns [11]. We did so using the transfer learning approach 
described above, and split each course from each dataset into 
a train/test set at the student level with a ratio of 2:1 for 
evaluation. 


3.4.1 Quantitative Results 

The results of our method are shown in Table 1. Because 
students’ learning actions contain meaningful sequential de- 
pendencies with each other, which patterns that happen fre- 
quently do not naturally capture. Consequently, CM-SPAM 
patterns were slightly less useful for inferring high level in- 
formation (predicting student outcomes), as shown in Ta- 
ble 1. Additionally, instructors may be able to interpret the 
patterns extracted by our methods to intervene in future 
courses, given that the extracted patterns are few enough 
in number (25) to manually review and are related to out- 
comes. Generally, our approach outperformed traditional 
sequential pattern mining for almost all courses across the 
two datasets. Our approach improved the correlation of se- 
quences with outcomes by an average of r = .131 on the 
EPM dataset, and was as good or better than CM-SPAM 
(r improved by .101) on the OULAD dataset. The result 
confirmed the usefulness of our predictive patterns derived 
via self-supervised learning. 


3.4.2 Pattern Visualization and Analysis 

We visualize the patterns extracted by our approach, demon- 
strate that they have the desired structure, and compare 
them with the traditional CNN filters. 


To compare our method and typical CNN patterns, typical 
densely-distributed CNN weights lend little insight into the 
specific sequences of actions that activate filters (shown in 
Figure 1). These CNN patterns conflate selection of relevant 
input actions with weighting those patterns, which prevents 
their use as a sequence mining method. In addition, typical 
CNNs only extract patterns that correlate with student out- 
comes (in fully supervised applications). As a result, they 
do not necessarily learn dependencies among students ac- 
tions; it remains to be seen whether our method applied 
in a fully-supervised model would produce notably different 
sequences of behaviors. Regardless, the patterns learned by 
our method are interpretable relative to typical CNN fil- 
ter weights, which makes them straightforward to utilize for 
other downstream tasks even though they come from a self- 
supervised model. However, patterns with multiple concur- 
rent actions are still less easily interpreted than patterns for 


T1 


T2 


T3 


Tl 
T2 
T3 


aulaweb 


properties 
texteditor 

aulaweb 
properties 
texteditor 


T1 Al 

rT | | 

3 Fa 3 @ 
8 


a |_| a | 
are || 1g 
2 ee 7. 

iH 

4 

g 

g 

3 


am 
m 
‘her 
ties 
am 
m 
her 
ties 


aulaweb 
blank 
prope 
s 
texteditor 
blank 
deeds 
prope 
sl 
texteditor 


Figure 1: The bottom two rows are randomly-chosen example 
patterns extracted by our approach based on EPM course 2. 
The top two rows are traditional CNN filters. 


unique actions given the possibility of many different con- 
current actions. They are, however, still straightforward to 
transfer to downstream tasks. 


4. CONCLUSION 


In this paper, we presented a general self-supervised se- 
quence mining algorithm that works for both sequences of in- 
dividual actions and multiple concurrent actions. We mined 
sequential patterns by convolutional neural networks, and 
applied transfer learning to judge the quality of the ex- 
tracted patterns for predicting student outcomes as an ex- 
ample downstream prediction task. Our results showed that 
the extracted patterns were indeed useful, as measured by 
the correlation between predictions and student outcomes. 


We empirically demonstrated that the patterns extracted by 
our method have similar or higher utility for two prediction 
tasks than those extracted via a traditional frequent pattern 
mining algorithm, while the extracted patterns can still be 
easily interpreted. Furthermore, our approach deals with 
common pattern mining problems like pattern explosion by 
training a fixed number of convolutional filters, where filters 
are selected from the space of all possible filters via stochas- 
tic gradient descent. 


In summary, our approach is a novel and interpretable way 
to extract predictive patterns of actions from sequential data. 
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Table 2: Description of actions in the OULAD dataset. Infrequent actions were grouped together into an other category, with 
the exception of transfer given that it is one of the most semantically important, along with register and unregister. 


Table 3: Description of activities in the EPM dataset. Internal activity names from the EPM dataset are provided to enable 


Action Description Frequency 
homepage Visit the main course page 1,735,226 
gap One or more consecutive days with no action 860,356 
oucontent View course content page 829,476 
forumng Discussion forum usage 822,895 
subpage Manage/view course activities on a page other than the homepage 804,577 
resource Download a document from the course 399,961 
url Click a link to an external site 314,240 
quiz Take a quiz 211,497 
exam Take an assessment 160,498 
ouwiki Access the course wiki 89,406 
register Register for the course 32,548 
unregister Drop the course 10,072 
transfer Transfer grade from previous session (semester) 526 
Infrequent activities grouped together as “other” 
page Non-interactive information page 47,549 
oucollaborate Audio/video conferencing 47,334 
externalquiz Externally-hosted quiz 41,642 
glossary View course glossary 17,258 
questionnaire Access survey form 15,109 
ouelluminate Audio-only conferencing 11,384 
dualpane Side-by-side view of instructions and related content 9,256 
dataplus Interact with a toy SQLite database 6,818 
htmlactivity Interactive HTML page 6,016 
folder View folder containing related activities 4,678 
sharedsubpage View page shared from another course 148 
repeatactivity Activities repeated from earlier in the course 3 


unambiguous matching to the original data. 
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Action Description Frequency 
texteditor Use a text editor 42,431 
deeds Other DEEDS (Digital Electronics Education and Design Suite) activities 38,372 
other Not viewing any pages described above (mostly off-task activities) 33,602 
blank Title of visited page is not recorded 24,303 
study View exercises or materials related to courses 22,261 
diagram Use a “simulation timing diagram” to test a solution 20,815 
fsm Use a finite state machine (FSM) simulator 20,596 
properties Set parameters of a simulation or design 19,677 
aulaweb Visit learning management system 8,261 
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ABSTRACT 


Recent studies proved the existence of a relationship between the 
complexity of university curricula and graduation rates. As a result, 
extensive efforts have been done in an attempt to restructure curric- 
ula in order to improve graduation rates. In this paper, we propose a 
new model for evaluating and quantifying the impact of restructuring 
curricula on graduation rates using a Bayesian network framework. 
We validate our model by analyzing a common curricular pattern 
found in most of the engineering programs. We demonstrate its 
usefulness using actual data for students at the University of New 
Mexico. We also extend this model to include a helpful tool that 
can be used to predict student performance. The advantage of our 
work is characterized by its data-driven nature which makes it more 
reliable than other proposed models. 


Keywords 
Curricular analytics, Bayesian networks, education, curriculum com- 
plexity, student success, graduation rate 


1. INTRODUCTION 


Recently a significant amount of work has been done on curricular 
analytics to show its impact on student success 
. The work done mainly spots the light on the importance of 
curricula structure on student performance characterized primarily 
by graduation and retention rates. These studies argue that the com- 
plexity of prerequisite dependencies between the requirements of a 
curriculum can increase the risk of students stopping out and even- 
tually dropping the school. A lot of work has been done recently 
identifying factors that help students retain their school and hence 
graduate at faster rates (6). This includes new learning pedagogy 
styles, dorms, flipped classrooms, learning centers, etc. However, 
such factors solely might fail to significantly contribute to student 
success if other institutional factors are overlooked; essentially cur- 
ricula structure . As mentioned earlier, it is already proved that 
the structure of a curriculum has a direct impact in student success 
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3}. In this regard, Klingbeil and Bourne considered a case study 
and analyzed the structure of a common curricular pattern found in 
most of the engineering programs (Figure[I} (5). They noticed that, 
in the sophomore year, most of the engineering programs require 
Differential Equations as a prerequisite to a domain specific course 
(in electrical engineering programs, Circuits I is domain specific; 
in mechanical engineering, Mechanics (statics and dynamics) is do- 
main specific; etc.). They also noticed that all the learning outcomes 
in Differential Equations, except for solving linear differential equa- 
tions, are not necessarily required to pass the domain specific course. 
Thus they suggested to create a new course in the freshman year 
to teach how to solve linear differential equations along with the 
Precalculus materials. They called this new course Engineering 101. 
As a result of this observation, they pointed out that Differential 
Equations is not required anymore as a prerequisite for the domain 
central course; only Engineering 101 does. This resulted in a re- 
vised curricular pattern shown in Figure[2| Klingbeil and Bourne 
claim that this new curricular pattern will help students graduate in 
a timely fashion. This is driven by the fact that the students are not 
required anymore to follow the long chain of prerequisites before 
they are allowed to take a domain specific course-as it is the case 
in the original curricular pattern. This new pattern is now pursued 
by a number of universities (5). And to validate the legitimacy of 
such changes to curricular patterns, a number of researchers came 
up with different mathematical models that prove the significance 
of these changes on student success outcomes. Slim et al. came 
up with a metric that quantifies the complexity of any curricular 
pattern . Their metric showed that the revised engineering pat- 
tern shown in Figure[2]is less complex than that of the original one 
shown in Figure[I]2}. Thus, according to their metric, students are 
expected to finish their degrees at a faster pace. Furthermore, Hicke- 
man showed that the revised engineering pattern can significantly 
improve graduation rates (3). He proved that by implementing a 
Monte Carlo simulation through which virtual students are allowed 
to flow through a curricular pattern. For more details on different 
models see (2). Although these models put a mathematical founda- 
tion to prove the advantage of restructuring curricula, they still have 
a major limitation. None of these models include actual student data 
in their implementations. In other words, these models only show 
the advantage of restructuring curricular patterns in its abstract state 
without using any data-driven approaches. This makes the models 
less reliable in proving the need to revise any curricular pattern. In 
this paper we propose a data-driven model that uses actual student 
data to achieve this. Particularly we use a Bayesian Network (BN) 
model to statistically prove the validity of any effort to restructure 
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a curricular pattern. This is mainly characterized by adding and 
removing courses/prerequisites within a curricular pattern which 
can be neatly captured using a BN. The main motivation behind 
our proposed model is the ability to use the notion of hidden/latent 
variables in building a BN. These hidden nodes represent new added 
courses in the revised curricular pattern. Thus they can be used to 
check the validity of the changes made to the original curricular 
pattern and accordingly decide whether to apply these changes or 
not. It is important to note that our model can be generalized to find 
the optimal structure of a revised curricular pattern using different 
methods of structural learning for a BN (ij. However we will not 
cover this part in our paper. We will leave it for a future work. The 
remainder of this paper is structured as follows. In Section II we 
present the details of our proposed framework and provide a case 
study. In Section III we present a number of applications for our 
proposed model and show some simulation results. Finally, Section 
IV presents some concluding remarks. 


OOOO 


Precalc Calculus | Calculus Il Diff. Eqs Central 
Course 
Figure 1: The original curricular pattern found in most en- 


gineering programs. The nodes represent the courses and the 
directed edges represent the prerequisite dependencies. 
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Figure 2: The revised curricular pattern. 


2. BAYESIAN NETWORKS AND HIDDEN 
VARIABLES 


A BN is a directed acyclic graph (DAG) representing correlations 
between a number of random variables (4). The graph-like structure 
reflects a confined representation of the joint probability distribution 
underlying these variables. The presence of an edge between two 
nodes indicates the existence of a relationship between two variables 
and the direction of the edge indicates the direction of the causal 
relationship. That is a directed edge from node A to node B indicates 
that A causes B. In BNs, these types of relationships are quantified 
by conditional probability tables (CPTs). The main feature of a 
BN is its compact representation of the conditional dependencies of 
the random variables. However, in some applications the BN gets 
complicated and thus in this case adding hidden nodes would be 
essential for two main reasons (8): 


1. Knowledge discovery: reveals interesting relationships among 
the variables of the data 

2. Lower complexity: attains lower structure complexity of the 
network 


For example, consider the case where we observe a bunch of vari- 
ables representing different patient’s symptoms. The joint probabil- 
ity distribution for these symptoms might be highly connected. In 
this case, the BN representation of these symptoms would be highly 


Figure 3: Two DAGs (one with a hidden node and the 
other without) representing the relationship between symp- 
toms, causes and mediating factors. Symptoms, such as chest 
pain, are represented by the leaves. Causes, such as smoking 
and diet, are represented by the roots. Mediating factors, such 
as heart disease, are represented by the hidden nodes. This fig- 
ure shows how hidden nodes can reveal a better understanding 
of the relationship between variables by attaining a lower struc- 
ture complexity of the network. 


complicated. However if we introduce a “cause" node representing 
the underlying disease for these symptoms then we can get a no- 
ticeably simpler network. We call this cause node a hidden node. 
Figure[3](inspired from (7) illustrates this example in more details. 
In a similar scenario in an educational context, a curricular pattern 
can be modeled as a BN. A node maybe a course, and the states 
of the node would be the possible letter grades (i.e., A,B+,B,C+, 
etc.). A directed edge from course A to course B indicates that 
the performance in A influences that in B. In this context adding 
a hidden node to the BN is equivalent to adding a new course to a 
curricular pattern. This hidden node might represent the underlying 
prerequisite course that needs to be taken prior taking other courses. 
This notion constitutes the bulk of our proposed work and the subse- 
quent sections elaborate more about this idea. Following this notion, 
restructuring the BN of any curricular pattern would include these 
steps: 


e Removing an existing course(s) (i.e., removing a node) 

e Adding a new course(s) (i.e., adding a hidden node) 

e Removing an existing prerequisite(s) (i.e., removing a di- 
rected edge) 

e Adding a new prerequisite(s) (i-e., adding a directed edge) 


We denote the restructured BN by R, characterizing the revised 
curricular pattern. Once R is constructed, we fit the CPTs using 
actual student data, denoted by D, and then compute the likelihood 
of R, p(D/R). Similarly, we denote by O the BN of the original 
curricular pattern. Once it is constructed, we compute its likelihood, 
p(D/O). If p(D/R) is greater than p(D/O), then the revised version 
of the curricular pattern fits the student data better than that of the 
original one. In this case, the proposed revised curricular pattern 
can be a good candidate to replace the original one. To elaborate 
more about this, we present a case study in the following section. 


2.1 Engineering Curricular Pattern: A Case 
Study 


In this section, we consider, as a case study, the engineering curricu- 
lar patterns shown in Figure[T]and Figure[2| Recall that the graph 
in Figure[I]represents the original pattern whereas that in Figure[2] 
represents the revised one. For these two graphs, we construct two 
BNs denoted by O and R respectively. These two BNs are shown in 


768 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


Course Number Course Name 
MATH 150 Precalculus 
MATH 162 Calculus I 
MATH 163 Calculus II 
MATH 264 Calculus III 
MATH 316 Differential Equations 
PHYC 160 General Physics I 
PHYC 161 General Physics II 

ECE 203 Circuits I 
ENG 101 New Course Proposed 


Table 1: Engineering courses taught at freshman and sopho- 
more level at UNM. 


Figure|4]and Figure[5] The variables used to construct O and R are 
shown in Table[I] These variables represent actual courses taught at 
the University of New Mexico (UNM). The states of each of these 
variables are the letter grades: A, B, C and D/F where D and F are 
assumed to represent one state. The CPTs for both O and R are com- 
puted using a dataset, denoted by D, for 1,000 UNM student. Each 
row in this dataset contains the letter grades achieved in the courses 
shown in Table[I] Some grades for some students were not available. 
Thus, the dataset included missing values. In addition, ENG 101 
in Figure[5]is considered a hidden variable because it is supposed 
to represent the new proposed course and thus we do not have the 
letter grade values. Therefore, to compensate for hidden and miss- 
ing values in our dataset, we used the expectation-maximization 
(EM) method to compute the CPTs for O and R (9). Using EM, we 


computed the ratio of the likelihood, ant. to be 2.89. This means 


that R fits the student data better than O. This result suggests that the 
proposed revised curricular pattern is a good candidate to replace 
the original one. Not only does it have a less complex structure 
but also the revised curricular pattern has the potential to improve 
student performance when compared to the original pattern. We 
concluded this using actual student data which is an advantage over 
other proposed models in literature (2). In the following sections we 
present more applications of BNs in the context of course network. 


OOOO) 


MATH 150 = MATH162 MATH163 MATH316 ECE203 
Figure 4: The original curricular pattern found in the electri- 


cal engineering curriculum. 


ECE203 


ENG101 


MATH162 MATH163 MATH316 
Figure 5: The revised version of the curricular pattern in the 


electrical engineering curriculum. 


3. INFLUENCE OF STUDENT CHARACTER- 


ISTICS ON ACADEMIC PERFORMANCE 


As mentioned earlier, many institutions are dedicating lot of efforts 
on student success. Colleges and universities are applying ever 


more sophisticated analytical tools to track their student progress 
in an attempt to improve their performance characterized mainly 
by graduation and retention rates. Intuitively, early indicators of 
student performance in this context is crucial to provide suitable 
interventions when needed. Thus predicting the performance of 
students in future courses is essential to achieve it. In this regard, 
historical information about previous academic achievement of a 
student could be used to project future performance. For example, 
a student who receives a ‘B’ in Calculus I is expected to receive a 
better grade in Calculus II compared to those who receive a ‘D’. A 
BN in the context of course network can capture the correlation in 
performance between such courses. Further, it can be used to pre- 
dict the letter grades of a student in subsequent classes based on the 
grades of previous classes. The accuracy of prediction can further 
be improved by adding other factors related to student characteris- 
tics. Factors such as age, gender, high school GPA, socioeconomic 
status, etc. proved to influence student performance (1). For this 
reason, it would be rational to add such factors as additional vari- 
ables to the BN of a course network. The advantage of using a BN 
model to capture all these variables together is two-folded: it can 
be used as a knowledge discovery model that can neatly display 
the correlation between the variables and also it can be used as an 
inference tool to predict student performance. In this section, we 
construct a BN for eight engineering courses along with five other 
variables representing student characteristics. The eight courses 
are: MATH 150, MATH 162, MATH 163, MATH 264, MATH 316, 
PHYC 160, PHYC 161 and ECE 203 (Table[ip. These courses are 
considered to be the most crucial classes at the freshman level. As 
for student characteristics, we considered Gender, ACT score, and 
high school GPA. As mentioned earlier the student characteristics 
are proved to influence student performance. Thus it would be inter- 
esting to discover and visualize how all these variables are related 
to each other. In the following section we show the constructed BN 
along with some applications. However, it is important to mention 
here that a similar work has been done in {10}. The authors of this 
work used a domain expert to construct the BN for these variables. 
This means that the process of constructing the BN is not automated 
and doesn’t guarantee a good fit to the student data. In this paper, 
however, we automated the process of learning the structure of the 
BN using a score-based learning algorithm (1). Particularly, we 
used the hill climbing (hc) greedy search that explores the space of 
the DAGs by single-arc addition, removal and reversals. 


3.1 A BN for Engineering Courses 

To construct and validate our framework, we collected a dataset of 
3,000 undergraduate student in the college of engineering at UNM. 
The dataset represented different demographical and academical 
variables of the students. The states for each of these variables are 
show in Table|2] It is important to note that MATH 162 is the prereq- 
uisite for MATH 163, MATH 163 is the prerequisite for MATH 316 
and MATH 264, and MATH 316 is the prerequisite for ECE 203. 
The constructed BN is presented in the graph shown in Figure|6] It 
is tempting to interpret this graph in terms of causality. In particular, 
it seems that ACT score, high school GPA and gender, in contrast 
to ethnicity, causally influence the performance of students in these 
engineering courses. Also, this graph shows that the performance 
in PHYC 160 influences that in MATH 264 and ECE 203. This 
is an interesting observation because, according to the department 
policy, PHYC 160 is not a required prerequisite for neither of these 
courses. Though it has an impact on both these courses. Another 
interesting observation is the absence of any correlation in perfor- 
mance between MATH 316 and ECE 203 even though MATH 316 
is a prerequisite to ECE 203. This observation confirms the fact that 
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Variable States 


Gender Female, Male 
Ethnicity 7 different ethnicities 
ACT score Integer between 10 and 36 
High school GPA Real value between 1 and 4 
MATH 150,162,163,264,316 A,B,C, D/F 
PHYC 160,161 A,B,C, D/F 
ECE 203 A,B,C, D/F 


Table 2: The courses and the student characteristics used to 
build the BN. 


MATH 150 Gender 


MATH 163 


@ Ethnicity 


PHYC 161 


MATH 264 


MATH 316 ECE 203 


Figure 6: The constructed BN using UNM student data. 


only a small portion of the learning outcomes in MATH 316, namely 
the ability to solve linear differential equations, are actually used in 
ECE 203 and the absence of a link between these two courses proves 
it. This is another evidence that supports the claim that it is needed 
to restructure the original curricular pattern shown in Figure[I] 


4. CONCLUSION 

In this paper we presented a framework that models a curriculum 
as a Bayesian Network (BN). We showed that this model, in the 
context of a course network, can be used to quantify any effort to 
restructure curricular patterns. In particular, we used the notion of 
hidden variables to achieve this. We validated our proposed model 
using a common curricular pattern found in most of the engineering 
programs. For that, we used actual data for students at the University 
of New Mexico. The results showed that the likelihood of the revised 
version of the engineering curricular pattern is higher than that of the 
original one. This suggests that the revised version can help students 
perform better in their courses as well as graduate at a faster pace. 
The advantage of our model over other proposed models in literature 
is its data-driven nature which makes it more reliable. Furthermore, 
we extended our model to use it as a knowledge discovery and 
inference tool. Particularly we added variables related to student 
characteristics (e.g. gender, ACT score, high school GPA, etc.) 
and showed how they can influence student performance. We also 
showed how to exploit the constructed BN to predict the grades 
of a student in following semesters based on grades of previous 
semesters. 
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ABSTRACT 


Feedback is a crucial element of a student’s learning pro- 
cess. It enables students to identify weaknesses and improve 
self-regulation. However, studies show this to be an area of 
great dissatisfaction in higher education. With ever-growing 
course participation numbers, delivering effective feedback is 
becoming an increasingly challenging task. Hence, this pa- 
per explores the use of automated content analysis to exam- 
ine feedback provided by instructors for good feedback prac- 
tices measured on self, task, process, and self-regulation lev- 
els. For this purpose, four binary XGBoost classifiers were 
trained and evaluated, one for each level of feedback. The 
results indicate effective classification performance on self, 
task, and process levels with accuracy values of 0.87, 0.82, 
and 0.69, respectively. Additionally, inter-language transfer- 
ability of feedback features is measured using cross-language 
classification performance and feature importance analysis. 
Findings indicate a low generalizability of features between 
English and Portuguese feedback spaces. 


1. INTRODUCTION 


Despite widespread recognition of feedback’s importance to 
learning [23, 29, 10], much of the current literature indicates 
a pervasiveness of low quality feedback in higher education 
[13]. Feedback quality is consistently rated one of the great- 
est causes of dissatisfaction for higher education students 
[9]. LA researchers are actively exploring automated feed- 
back solutions that can enable instructors to efficiently iden- 
tify and employ good feedback practices, and improve the 
speed of feedback delivery to students [15]. In that vein, 
several studies [17, 19, 28, 30] have examined the use of 
data mining methods to generate automated textual feed- 
back. These analyses are often limited to domain specific 
areas such as computer programming or writing, or lack of 
grounding in educational theory. Much less work has gone 
into the exploration of automated domain-agnostic analy- 
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sis to identify good feedback practices [4, 24]. Progress in 
such areas can enhance the instructor’s ability to provide ef- 
fective feedback comments and analyze features associated 
with good feedback practices for generalizable feedback gen- 
erators. Therefore, this study aims to answer the following 
Research Questions (RQs): 


1. To what extent can the automated analysis of feedback 
messages be used to identify good feedback practices? 


(a) How accurate are the predictions that are made 
about these feedback practices? 


(b) What are specific features of text that can be used 
to predict the use of good feedback practices? 


2. How transferable are the identified feedback features 
to text written in different languages? 


2. METHOD 
2.1 Data 


The dataset used in the current study consisted of feedback 
comments provided by instructors in Learning Analytics, 
Software Engineering, and Environmental Studies courses. 
A total of 2,092 observations were taken; 1,000 Portuguese 
records and 1,092 English records. 


2.2 Coding Scheme 


This study utilized Hattie and Timperley’s [14] four levels 
of feedback due to its suitability for textual analysis due to 
its focus learning tasks, learning process, and self-regulation 
Cavalcanti et al. [4]. Hence, feedback examples were coded 
using Hattie and Timperley’s [14] proposed four levels of 
feedback (see Table 1). 


Feedback examples were coded by experts using instructions 
of Hattie and Timperley’s [14] study. Each feedback record 
was examined by two expert coders separately. After this 
step, the differences between each pair of experts were com- 
pared. For the Portuguese feedback examples, the inter- 
rater agreement reached 72.2% with a Cohen’s kappa (inter- 
rater ability considering chance [7]) of 0.44. The English 
feedback comments had inter-rater agreement of 63.8% and 
Cohen’s kappa of 0.38. These measures met expectations for 
content analysis experimentation [20]. 
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Table 1: Four levels of feedback identified by Hattie and Timperley [14]. Each level specifies different elements that the 
feedback is targeting and can be regarded as hierarchical, ranging from general comments made about the student themselves 


up to directives on how to improve self-regulation. 


Level Description 


Example 


Feedback about the self (FS) 


Personal evaluations about the learner 


“You are a bright student” 


Feedback about the task (FT) 


How well tasks are understood or performed “You need to include more about the Treaty of Versailles.” 


Processes needed to understand or 


Feedback about the process (FP) peffonm tasks 


“You need to edit this piece of writing by attending to the 
descriptors you have used so the reader is able to understand 
the nuances of your meaning.” 


Feedback about self-regulation (FR) How to improve self-regulation 


Table 2: Number of instances for each class in the training 
and test datasets for each level of feedback. 


Class 0 Class 1 Total 
FS Train 1149 (82.19%) 249 (17.81%) 1398 (70%) 
Test 567 (82.17%) 123 (17.83%) 690 (30%) 
FT Train 602 (43.06%) 796 (56.94%) 1398 (70%) 
Test 297 (43.04%) 393 (56.96%) 690 (30%) 
FR Train 1290 (92.27%) 108 (7.73%) 1398 (70%) 
Test 637 (92.32%) 53 (7.68%) 690 (30%) 
FP Train 808 (57.80%) 590 (42.20%) 1398 (70%) 
Test 399 (57.83%) 291 (42.17%) 690 (30%) 


The annotation process led to a dataset with four sets of 
binary classes: class 0 if a feedback message did not belong 
to a particular level; class 1 if the feedback message belonged 
to the feedback level. 


2.3 Feature Engineering 

Feature extraction was informed by relevant studies [4, 24, 
16]. The studies promote the use of linguistic features such 
as those developed in LIWC (Linguistic Inquiry and Word 
Count) [27] and Coh-Metrix [11] over traditional textual fea- 
tures such as lexical N-grams or Part-Of-Speech. According 
to Kovanovié et al. [16], these features encourage overfitting 
by inflating the feature space. Additionally, these tradi- 
tional features are data dependent and thus make it diffi- 
cult to define the feature space beforehand [16]. Hence, we 
used feature sets that incorporated 86 LIWC [27] features, 
78 Coh-Metrix [11] features, and two additional features, 
which are relevant to this content area — number of named 
entities and language of delivered feedback. 


2.4 Analysis 
2.4.1 Data Analysis and Pre-processing 


For the general classifier, feedback examples from both the 
English and Portuguese datasets were combined and split 
into 70% training and 30% test sets (Table 2). The training 
data suffered from class imbalances; particularly at the FS 
and FR levels. 


2.4.2. Handling Class Imbalance 


Studies have shown class imbalances can have a negative 
impact on model prediction performance [26]. To alleviate 


“You already know the key features of the opening of an 
argument. Check to see whether you have incorporated 
them in your first paragraph.” 


the class imbalance problem, sampling algorithms are often 
employed to adjust the ratio of represented classes. SMOTE 
is a popular oversampling method that analyzes the data 
records in a two-dimensional vector space of given classes 
and generates data points as a linear combination of existing 
data points [5]. 


2.5 Model Selection and Evaluation — RQ1a 


Decision tree ensembles are widely regarded classification 
algorithms that are well suited to feedback analysis [4, 24]. 
This is due to their white-box properties, easy interpretabil- 
ity, high accuracy and ability to identify important features 
in a dataset [4, 24, 6, 8]. 


This study employed a decision tree implementation called 
XGBoost [6]. XGBoost has been shown to outperform Ran- 
dom Forest on numerous classification tasks [22, 31]. The 
algorithm utilizes gradient boosting, which involves sequen- 
tially combining models (in this case, decision trees) that 
predict the residuals or errors of previous models at each 
iteration to improve overall accuracy [6]. XGBoost is ideal 
due to their superior accuracy and their implicit analysis 
of feature importance [6]. Four binary XGBoost classifiers 
were trained; one for each level of feedback. 


2.5.1 Feature Analysis —- ROIb 


The outputs of decision tree models can be analyzed with 


tools such as SHAP (SHapley Additive exPlanations) [18].Given 


an input of a machine learning model and data records, 
SHAP leverages the concept of Shapley values by measuring 
the average marginal contribution of a feature over all pos- 
sible permutations. SHAP can diagnose the most impactful 
features using their SHAP value, which is the mean absolute 
contribution of each feature [18]. A higher SHAP value for 
a feature implies a greater importance compared to another 
feature. 


2.5.2 Feature Transferability — RQ2 

To measure the transferability of features across languages, 
the dataset was split by language, creating Portuguese and 
English feedback datasets. Each of these datasets was split 
into training and test splits (70% training and 30% test set), 
and binary classifiers were trained and tuned, resulting in 
English feedback trained classifiers, and Portuguese feed- 
back trained classifiers for each level of feedback, with the 
exception of the FR level. For the Portuguese feedback ex- 
amples, the FR level had just eight positive instances out 
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Table 3: Performance of the classifiers trained to address 
research question RQ1 on the combined dataset involving 
both the English and Portuguese datasets. Legend: ACC — 
Accuracy; K — Cohen’s kappa; F1 - F1 Score. 


fl FS | FT | FR | FP 


Class Balancing | ACC = -K Fl | ACC K Fl | ACC K Fl | ACC K FI 


None 0.88 0.52 0.58 | 0.82 0.64 0.83 | 0.92 0.00 0.00 | 0.68 0.33 0.57 
SMOTE 0.87 0.51 0.58 | 0.82 0.65 0.83 | 0.91 0.04 0.07 | 0.69 0.35 0.59 


of 1,000 records, which was not enough to train a machine 
learning algorithm [12]; hence, this level of feedback was 
excluded from all transferability analysis. 


Once the English and Portuguese trained classifiers were 
developed, feature transferability was measured by i) the 
inter-language prediction performance: the prediction per- 
formance (measured by accuracy, F1 score and Cohen’s k) 
of the English trained classifier on the English test set was 
compared to the predictor performance on the Portuguese 
test set for the FS, FT, and FP levels of feedback. The same 
process was repeated for the Portuguese trained classifier; ii) 
A comparison of significant features: The most important 
features for the English and Portuguese trained classifiers 
are compared at the FS, FT and FP levels of feedback. 


3. RESULTS AND DISCUSSION 


The goal of this study was to examine how accurately one 
can model the feature space of good feedback practices, and 
how this feature space varies across languages. In that vein, 
four research questions were answered using novel statisti- 
cal learning methodologies, with a view of promoting good 
feedback practices at scale. 


3.1 Model Performance —- RQla 


Research question RQ1 focused on investigating the extent 
to which the automated analysis of feedback messages can be 
used to identify good feedback practices. Four binary classi- 
fiers were developed using a variety of features (see 2.3) The 
best performing models were effective in identifying FS, FT 
and FP. While not a direct comparison due to the addition 
of the English feedback examples, the models achieved bet- 
ter results over those reported by Cavalcanti et al. [4]. The 
classifiers were able to improve accuracy by 0.07 and 0.05 
for FT and FP, respectively and increase kappa values by 
0.11, 0.35, and 0.06 for FS, FT, and, FP, respectively. 


Similar to previous works [4], the FR classifier was not as ef- 
fective in identifying instances. The model obtained a poor 
kappa of 0.06, which was likely caused by the model’s poor 
ability of detecting positive cases of FR. Poor performance 
on this level was due to the significantly lower cases of pos- 
itive instances as compared to the other levels of feedback. 


3.2 Feature Analysis —- RQ1b 


The focus of research question RQ1b was analyzing the most 
important textual features associated with the four levels of 
feedback. Hattie and Timperley [14] state that FS involves 
evaluations of the person, which are often a form of praise. 
The current findings add weight to this claim, as those fea- 
tures found to be most important in predicting the FS level 
were affective processes (particularly, positive emotions) and 
social processes, which align with the concept of praise. FS 


is often thought to be the least effective level of feedback 
[3, 14] and relatedly, the FS classifier had a negative as- 
sociation with discrepancy words; this might indicate FS 
comments have little actionable information or insight. 


FT is sometimes referred to as corrective feedback and pro- 
vides information on details related to task accomplishment 
such as correctness or behavior. Accordingly, this study 
found the predictors most associated with FT were those 
that related to the amount of information provided. Specif- 
ically, higher values of word counts, frequency of content 
words and minimum frequency of content —- all of which can 
be linked to greater information —-- were positively corre- 
lated with observance of FT. Hattie and Timperley [14] sug- 
gest instructors not to rely solely on FT, but rather to view 
it as a process that moves the student to FP and FR. This 
theory is backed by the finding of strong negative associa- 
tion of causation words and FT; hence, FT comments were 
less likely to illustrate the causes of the student’s failings, 
which is essential for the learner’s self-regulation [14, 3, 21]. 


Compared to FT, FP is believed to promote a deeper un- 
derstanding of learning as it enables the identification of re- 
lationships between resources and output, and the develop- 
ment of stronger cognitive processes. To achieve this, Balzer 
et al. [1] state FP should concern information about actual 
relations in the learning environment, relations which have 
been recognized by the learner, and relations between the 
learning environment and the learner’s perceptions. There- 
fore, the value of FP comes from providing useful informa- 
tion on relationships. The findings of this study corroborate 
the theoretical views of FP. Amongst the most important 
features for FP were frequency of content words, adverbs, 
negative connectives and discrepancy words. These imply 
that FP comments were tied to providing new and corrective 
information. Other significant features can be tied back to 
relationships; including frequency of semicolons (semicolons 
are often used to link together ideas) and features associated 
with space and relativity. 


According to Butler [3], one of the goals of FR should be 
to improve the student’s ability to monitor current progress 
and use that information to form effective learning strate- 
gies. Accordingly, some of the most important predictors of 
FR were greater present and future focused processes. 


3.3 Feature Transferability -— RQ2 

To address research question RQ2, we studied inter-language 
classifier performance, and compared the most significant 
features for classifiers trained on different language feed- 
back. Barbosa et al. [2] used similar linguistic features to 
those used in this project, such as LIWC and Coh-Metrix, 
to study cross-language classification of cognitive presence 
in online discussions, and found features to be independent 
of language; hence, we expected to find a moderate level of 
generalizability of feedback features across languages. How- 
ever, our findings indicate a low transferability of feedback 
features. As seen on Table 5, the average accuracy differ- 
ential on inter-language performance amounted to -0.06, - 
0.59, and -0.26; while the average kappa differential was 
approximately -0.50, -0.27, and -0.33 for FS, FT, and FP, 
respectively. Likewise, the Portuguese and English trained 
classifiers showed minimal overlap in their most important 
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Table 4: Top 10 important features are measured using SHAP and displayed from most to least important for FS, FT, FR 
and FP classifiers. 


FS FT 
Variable Description SHAP | Variable Description SHAP 
liwe.Exclam Freq. of exclamation marks 1.02. | cm.WRDFRQa Freq. of all words 0.46 
liwc.posemo Freq. of words with positive emotion 0.73 | cm. WRDFRQc Freq. of content words 0.39 
liwc.you Freq. of the word ”you” 0.24 | cm.WRDFRQmc Minimum freq. of content words 0.34 
liwe.affect Freq. of affective words 0.20 | cm.DRNP Noun phrase density 0.10 
cm.SYNMEDlem Minimal edit distance of lemmas 0.20 | cm.DRAP Adverbial phrase density 0.10 
cm.WRDFRQc Freq. of content words 0.15 iwc.SemiC Freq. of semicolons 0.08 
liwe.tentat Freq. of tentative words 0.15 | cm.DESWLsy Mean word length 0.07 
liwc.reward Freq. of words associated with reward 0.14 iwc.adverb Freq. of adverbs 0.07 
liwe.informal Freq. of informal words 0.14 iwc.social Freq. of words related to social processes 0.07 
cm.WRDPRP2 Freq. of second person pronouns 0.14 iwc.article Freq. of articles 0.07 
FS FT 
Variable Description SHAP | Variable Description SHAP 
cm.CRFNO1 Noun overlap between adjunct sentences 0.56 iwc.SemiC Freq. of semicolons 0.39 
cm.WRDPRP3s _ Freq. of third person pronouns 0.50 | cm.LSASS1 LSA measure of semantic coherence 0.19 
cm.CRFSO1 Word stem overlap between adjunct sentences 0.43 | cm.CNCNeg Freq. of negative connectives 0.12 
cm.DRAP Adverbial phrase density 0.35 iwc.adverb Freq. of adverbs 0.11 
cm.CRFCWOa Content word overlap of all sentences 0.25 | cm.DESWLItd Standard deviation of average no. of letters/word 0.09 
liwe.risk Freq. of risk related words 0.23 iwc.space Freq. of words related to space 0.09 
liwe.differ Freq. of words related to differentiation 0.21 iwc.verb Freq. of verbs 0.08 
liwe.focusfuture Freq. of future focus words 0.21 iwc.shehe Freq. of third person singular pronouns 0.07 
liwc.focuspresent Freq. of present focus words 0.20 | cm.SYNLE Mean no. of words before the main verb 0.06 
liwe.affiliation Freq. of affiliation words 0.16 iwc.discrep Freq. of words associated with discrepancy 0.06 


Table 5: For RQ2 classifiers are exclusively trained on En- 
glish (EN) and Portuguese (PT) feedback examples. Per- 
formance of each classifier is measured against EN and PT 
feedback examples. Legend: ACC — Accuracy; K — Cohen’s 
kappa; F1 - F1 Score. 


4. CONCLUSION AND FUTURE RESEARCH 


This study proposed four main contributions. First, this 
study explored how accurately a trained model can identify 
the presence of different feedback practices. The constructed 
classifiers, using primarily linguistic and psychological fea- 
tures, were effective in identifying the presence of FT, FP 
and FS levels of feedback and showed better performance 
than similar works in this content area; however, the FR 


J ACC kK Fl ACC K FI acc K_ Fl 


EN Classifier EN | 0-88 0.42 0.52 | 0.69 0.18 0-81 | 0.66. 0.23° 0.49 classifier was marred by a lack of adequate data. The im- 
PT | 0.85 0.03 0.04] 0.11 0.00 0.00] 0.49 -0.02 0.30 ie: ; 
— plications of these results provide a proof of concept for a 
- EN | 0.79 0.06 0.12 | 0.28 0.00 0.43 | 0.35 0.00 0.52 : “ : 
PT Classifier Br] 9.94 0.74 0.77 | 0.91 0.49 0.95 | 0.78 0.56 0.79 tool that can automatically analyze and potentially diagnose 


the contents of an instructor’s feedback. This promotes the 
understanding and utilization of good feedback practices to 
improve their efficacy on learner adoption. 

features across all levels of feedback. 

Another goal of this paper was to identify the prominent 


One possible explanation for this finding might be the differ- 
ence in courses represented in the English and Portuguese 
datasets. English feedback examples were primarily from 
STEM related courses, including Environmental Studies and 
Software Engineering, while Portuguese feedback examples 
had more of a mix, hailing from Biology and Literature 
courses. Hence, the different nature of represented courses 
might have influenced the transferability analysis. 


Another explanation for the low transferability of features 
might be the cultural differences in communication. For 
instance, at the FS level of feedback, we observed greater 
association of friendship and social processes for the English 
feedback; i.e. English instructors might have displayed a 
greater level of familiarity with students. As an instructor 
can be viewed as an authority figure, this difference might 
be related to whether a culture is “horizontal”, and therefore 
emphasizes equality, or “vertical”, and emphasizes hierarchy 
[25]. The implications of this finding would indicate instruc- 
tors will need to consider the cultural backgrounds of the 
learner while delivering feedback for improved efficacy. 


textual features of good feedback practices. Identified fea- 
tures were able to corroborate the findings of educational 
research on feedback theory. The presented findings can be 
further used to inspire the design of future automated feed- 
back generators, e.g., intentionally including the prominent 
terms specific to different feedback practices when generat- 
ing feedback. 


Finally, this study conducted an analysis of the transferabil- 
ity of feedback features across languages. Feedback tools 
should be generalizable enough to cater to a variety of lan- 
guages. By analyzing the transferability of feedback fea- 
tures across languages, this study aimed to enhance the 
global adaptability of current and future feedback tools. The 
findings indicate feedback features have low transferability 
between feedback examples delivered in English and Por- 
tuguese. However, a more expansive study is suggested, 
with a greater size and variety of feedback from different 
languages. 
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ABSTRACT 


Academic integrity has been a frequently reported challenge in on- 
line education. Given the widespread transition to online program 
delivery during the COVID-19 pandemic, we ask the following 
question: How do college students feel about online cheating ? Our 
analysis is based on academic discussions on the Reddit social cu- 
ration platform in Fall 2020 and, for comparison, Fall 2019. We 
found more discussions related to cheating in 2020 than in 2019, 
and the topics have expanded from plagiarism in programming as- 
signments to online assessments in general. Topic modelling of the 
Fall 2020 discussions revealed three concerns raised by students: 
that cheating inflates grades and forces instructors to increase the 
difficulty of assessments; that witnessing cheating go unpunished 
is demotivating; and that academic integrity policies are not always 
communicated clearly. 


Keywords 


academic integrity, online education, social media, text mining 


1. INTRODUCTION 

Recent studies have reported that online academic misconduct has 
increased during the COVID-19 pandemic [12, 6, 2, 4, 3, 18]. We 
therefore ask the following question in this paper: How do college 
students feel about online cheating? To answer this question, we 
turn the Reddit social curation platform (reddit.com). Reddit hosts 
over 100,000 user-created discussion communities refereed to as 
subreddits. Within a subreddit, users create posts that other users 
comment on. Subreddit names begin with “r/” and correspond to 
the subreddit topic, e.g., r/politics or r/relationship_advice. 


Descriptive subreddit names make it easy to locate discussions 
about specific topics or discussions initiated by various kinds of 
users. Of interest to our study are over 80 subreddits corresponding 
to Canadian and U.S. universities, which we call academic subred- 
dits. We collected all posts and comments on academic subreddits 
created during the Fall 2019 and Fall 2020 semesters (September 
through December inclusive) that match at least one keyword re- 
lated to cheating, such as ‘cheat’ or ‘misconduct’. 
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Our analysis consists of two steps. First, collecting data from the 
same time period in 2019 and 2020 allows us to compare cheating- 
oriented discussions from before the pandemic, when classes were 
held in person, and during the pandemic, with most courses deliv- 
ered online. To do so, we train a logistic regression classifier to 
distinguish between Fall 2019 and Fall 2020 content based on the 
words used. Next, we analyze Fall 2020 discussions in detail. We 
apply the Non-negative Matrix Factorization algorithm [20], which 
clusters posts and comments based on the words used and allows 
us to identify common discussion topics. 


Related Work: Social media have become a go-to source of public 
opinion on a variety of topics. In particular, academic subreddits 
have been analyzed in recent work on students’ mental health [1, 
16], but academic integrity was not discussed. The closest works 
to ours are those in [4] and [5], which interviewed a small set of 
undergraduate students and educators. The participants identified 
some positive aspects of online education, but expressed concerns 
about cheating and the level of difficulty of online assessments. Our 
social media analysis explores these and other concerns in detail. 


2. DATA AND METHODS 


Previous work on students’ mental health [1, 16] identified 83 aca- 
demic subreddits corresponding to major U.S. and Canadian uni- 
versities. We analyze the same subreddits in this paper, listed in 
the first column of Table 1 (U.S.) and Table 2 (Canadian). We col- 
lected all posts and comments on these subreddits from the Fall 
2019 semester, when classes and examinations were held in per- 
son, and the Fall 2020 semester, when most campuses moved to 
online delivery (September-December inclusive). We downloaded 
the data using a publicly-accessible Reddit interface at pushshift.io. 


Next, we retain only those posts and comments that contain at least 
one of the following keywords: ’cheat’, ’plagiari’, and ’miscon- 
duct’. We perform substring matching, meaning that ‘plagari’ also 
matches ’plagiarize’ and ’plagiarism’. Tables 1 and 2 report the 
number of posts (“P’’) and comments (“C’’) on each U.S. and Cana- 
dian academic subreddit, respectively, in Fall 2019 and Fall 2020. 
The “Before” numbers correspond to all posts and comments. The 
“After” numbers correspond to posts and comments that matched at 
least one cheating-related keyword; note that there are three times 
as many such posts and comments in 2020 than in 2019 (7,809 vs. 
2,524) even though the total number of posts and comments on aca- 
demic subreddits has not changed much from 2019 to 2020 (see the 
total “Before” numbers in the last row of Tables 1 and 2). 


We then perform standard text pre-processing. Following previ- 
ous work on Reddit topic modelling [10, 16], we remove posts and 
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Table 1: Number of posts and comments on U.S. academic subreddits in 2019 and 2020 before and after filtering to find cheating- 


related discussions (C: Comments, P: Posts). 


Subreddits 2020 2019 
Before After Before After 

Cc P Cc P Cc P Cc P 
UIUC 39974 6991 160 21 40556 6431 104 10 
berkeley 37355 6343 365 69 28537 4637 114 7 
Cornell 36235 8139 165 27 22562 3900 45 8 
Purdue 34376 6317 148 15 33322 5273 42 11 
UCSD 30589 5798 175 34 28214 5364 106 15 
rutgers 29861 6622 269 69 44114 8902 122 16 
UMD 21937 4225 206 28 25794 4631 97 6 
SBU 20521 4301 163 20 28328 5373 63 13 
uofm 19954 3174 79 14 13553 2213 44 5 
udub 17867 3487 82 17 18187 3187 59 6 
UWMadison 14870 2447 103 18 14236 2039 33 3 
UTAustin 13620 3112 53 7 13866 2811 90 6 
utdallas 12763 2235 74 7 20731 3109 25 5 
PennStateUniversity | 12345 1944 64 5 9620 1610 42 2 
msu 12052 2104 86 10 15066 2329 23 6 
NCSU 11653 1794 72 5 18943 2524 32 1 
UVA 11627 2424 79 9 5071 1084 19 5 
rit 11603 1577 48 2 10768 1643 6 1 
nyu 11034 2952 37 7 5731 1438 10 2 
UNCCharlotte 10132 1709 93 12 10700 1508 18 1 
USC 9551 1958 82 15 6800 1419 17 4 
Baruch 9370 2226 94 16 4851 1144 36 12 
UPenn 8886 2083 55 10 4212 997 11 1 
UNC 8347 1644 30 8 3800 790 6 2 
byu 6951 707 39 2 3165 407 25 3 
UGA 6637 1520 20 3 6852 1349 2 0 
columbia 6496 1573 55 5 4699 708 22 3 
RPI 5652 1220 70 0 7622 1343 5 0 
uichicago 4880 894 46 4 6606 1009 84 1 
SJSU 4661 1068 27 5 5108 1136 18 3 
stanford 3944 1223 13 2 3782 882 10 0 
bostoncollege 3493 1006 0 0 753 188 0 0 
cmu 3388 657 27 2 2764 517 3 0 
washu 3159 572 4+ 0 1134 259 0 0 
Vanderbilt 2581 555 9 1 1447 311 0 0 
Harvard 2219 634 1 1 2294 517 1 0 
UMBC 2036 457 21 3 2479 464 4 0 
duke 2020 469 2 1 1397 317 7 2 
mit 1758 532 3 0 1651 373 4+ 0 
BrownU 1363 438 2 1 1315 276 0 0 
IndianaUniversity 1225 588 1 1 1797 543 9 1 
Caltech 494 130 0 0 220 59 0 0 
Total 509479 99849 3122 «476 | 482647 985014 =1358 171 


comments with fewer than 40 or more than 4000 characters: short 
ones are unlikely to be meaningful (and may correspond to URLs), 
while long ones may mention more than one topic. We also remove 
stopwords and lemmatize the remaining words using the Python 
NLIK parser. 


To distinguish between cheating-related discussions before and 
during the pandemic, we train a logistic regression classifier to pre- 
dict whether a post or comment was written in Fall 2020 or Fall 
2019. We use term frequency—inverse document frequency (TF- 
IDF) word scores as features in the model. We chose logistic re- 
gression due to its interpretable nature: words with positive coef- 
ficients represent Fall 2020 content and words with negative coef- 
ficients represent Fall 2019. Our model obtained a 10-fold cross- 
validation accuracy score of 73%, a precision of 76%, a recall of 
96% and an F1-score of 86%. 


(We also tested logistic regression models with additional features, 
including word bigrams, the sentiment of the post or comment 


(computed using the Valence Aware Dictionary and Sentiment Rea- 
soner (VADER) [8]) and linguistic features computed using Lin- 
guistic Inquiry and Word Count (LIWC) [17]. After adding these 
features, accuracy improved by two percent to 75%. However, none 
of these additional features were assigned large coefficients and 
therefore are not considered further in the remainder of the paper.) 


Finally, we apply the Non-negative Matrix Factorization (NMF) 
topic modelling algorithm [20], which was used in prior work on 
Reddit mining [14, 7, 11], on the Fall 2020 posts and comments 
that match at least one cheating-related keyword. We again repre- 
sent each post and comment using the TF-IDF scores of the words 
occurring in it. NMF clusters documents into topics and assigns a 
list of representative terms called topic descriptors to each topic. 
NMF also calculates the “representativeness” score of each topic 
descriptor, and we report the top-10 highest-scoring descriptors for 
each topic. Moreover, we report top-10 frequent word n-grams (for 
n up to three, i.e., sequences of up to three consecutive words) for 
each topic. 
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Table 2: Number of posts and comments on Canadian academic subreddits in 2019 and 2020 before and after filtering to find 
cheating-related discussions (C: Comments, P: Posts). 


Subreddits 2020 2019 
Before After Before After 

Cc P P Cc P Cc P 
uwaterloo 72244 8372 381 58 88996 9888 130 17 
UofT 54343 8460 701 86 67649 9375. 171 = 23 
UBC 40058 5281 766 8942 39416 5039 «109s II 
uA lberta 33265 7164 341 58 49494 8270 =137) 23 
McMaster 24556 5188 219 45 14932 2638 27 3 
megill 21380 3376 167 15 20852 3067 58 6 
yorku 15671 4065 228 46 22078 3862 47 6 
CarletonU 15455 2531 207 11 16874 2706 = 43 2 
Concordia 10065 2394 192 27 10292 2185 27 7 
uwo 9717 1856 122 10 11758 1764 35 2 
wlu 8097 1788 97 16 5499 1203 13 4 
uvic 7291 1178 85 3 4756 828 11 3 
ryerson 6503 2282 87 6 14922 2927 37 8 
queensuniversity 5234 1107 18 1 4758 824 6 2 
umanitoba 4408 861 66 7 3183 717 3. 1 
uoguelph 3381 794 8 3691 693 5 2 
Dalhousie 1807 401 4 2019 407 6 2 
usask 1177 290 0 0 666 178 0 0 
brocku 1007 366 2 0 1442 329 4 2 
memorialuniversity 785 183 6 1 637 147 2 0 
UdeM 422 90 1 0 174 48 0 0 
lakeheadu 119 59 2 1 51 21 0 0 
uleth 112 35 0 0 82 33 0 0 
University_Of_Regina 96 30 1 0 8 11 0 0 
AcadiaU 69 29 1 0 60 15 0 0 
UQAM 67 22 0 0 48 17 0 0 
uwinnipeg 65 24 2 1 15 10 0 0 
unb 62 35 0 1 8 12 0 0 
laurentian 33 16 0 0 9 4 0 0 
stfx 32 12 0 0 0 1 0 0 
SMUHalifax 24 17 0 0 21 9 0 0 
nipissingu 13 8 0 0 3 4 0 0 
UPEI 12 10 0 0 1 3 0 0 
stthomas 6 4 0 0 0 3 0 0 
BishopUniversity 5 2 0 0 0 4 0 0 
UNBC 3 5 0 0 15 10 0 0 
mta 1 0 0 0 6 6 0 0 
cbu 0 2 0 0 3 1 0 0 
MSVU 0 0 0 0 0 1 0 0 
uottawa 0 0 0 0 83 43 0 0 
usherbrooke 0 0 0 0 0 2 0 0 
Total 337585 58337 3764 447 | 384501 57305 871 124 


Additionally, NMF assigns a closeness score for each document- 
topic pair, indicating how close the document is to a topic. To ob- 
tain more information about the topics produced by NMF, for each 
topic, we manually inspect 5% of the posts and comments with the 
highest closeness scores. 


NMF requires the number of topics as input. Following previous 
work [15], we run NMF to produce between 5 and 50 topics and 
compute the coherence score for each. Coherence measures the 
extent to which the top representative terms representing each topic 
are semantically related (higher is better). We obtained the highest 
scores for 5 and 20 topics. A preliminary analysis of the NMF 
output at five topics revealed that most topics consisted of several 
discussion themes. This observation suggested that a larger number 
of topics may be more appropriate, and thus we selected 20 topics. 


3. RESULTS 


We begin with the results of our logistic regression analysis, shown 
in Table 4 in the Appendix. The most positive coefficients, pre- 


dicting Fall 2020 posts and comments, include ‘chegg’ (an online 
platform for answering college and high school questions), as well 
as words related to online proctoring such as ’proctor’, ’procto- 
rio’, zoom’, ’camera’, ’webcam’ and ’privacy’. The most negative 
coefficients, predicting Fall 2019 posts and comments, suggest in- 
person examinations (‘cheat sheet’, ’bring’, ‘sit’) and programming 
assignments and projects (‘code’, ’program’, ’project’). 


Next, we move to topic modelling. Table 3 shows the NMF topic 
descriptors, the frequent n-grams, and the percentage of posts and 
comments assigned to each topic. We group the topics into the 
following three categories based on the information in Table 3 and 
manual inspection of a sample of posts and comments. 


First, about 40% of the posts and comments include concerns about 
cheating leading to grade inflation, which in turn leads to assess- 
ments becoming more difficult. Students have observed grade in- 
flation (Topic 13) and expressed concerns that Fall 2020 examina- 
tions will be more difficult to reduce the class average (Topics 1 and 
20). Moreover, students commented on various methods used by 
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Table 3: Fall 2020 topic modelling results 


# | Topic descriptors Frequent N-grams % 

1 | work, really, time, way, learn, try, hard, help, | ‘feel like’, ’work hard’, first year’, ’high school’, office hour’, ’mental health’, learn mate- | 10.4 
school, good rial’, ’get catch’, ’make sure’, ’in person’ 

2 | say, academic, email, integrity, case, code, | academic integrity’, ’academic dishonesty’, ’integrity violation’, ’academic integrity viola- | 10.3 
worry, report, flag, mean tion’, ’get flag’, ’student conduct’, ’academic offense’, ’would say’, ’get catch’, ’even though’ 

3 | think, probably, pretty, fine, worry, fair, sure, | ‘think would’, ’think people’, ’think get’, ‘like think’, ’get away’, ’make sure’, ’really think’, | 6.4 
reason, away, good feel like’, ’think go’, ’think make’ 

4 | student, university, honest, case, punish, inter- | ‘international student’, ’student get’, ’many student’, ‘chinese student’, ’honest student’, ’aca- | 5.7 
national, chinese, issue, school, conduct demic integrity’, ’student would’, ’mental health’, academic dishonesty’, ’first year’ 

5 | know, want, let, wrong, happen, person, tell, | “let know’, ’want know’, ’know people’, ’get catch’, ’know would’, ’lot people’, ’feel like’, | 5.5 
need, mean, consequence *know know’, ’know go’, ’student know’ 

6 | prof, email, mark, ta, ask, tell, send, chance, | ’prof make’, ’first year’, ’email prof’, open book’, ’feel like’, ‘prof say’, ‘prof ta’, ’prof would’, | 5.4 
midterm, try *make sure’, ’ask prof’ 

7 | question, answer, time, ask, quiz, look, minute, | ‘answer question’, ’go back’, ’multiple choice’, ’short answer’, ’exam question’, ’one ques- | 5.1 
similar, wrong, google tion’, ’look answer’, ’question answer’, ’question exam’, ’choice question’ 

8 | test, open, book, note, close, online, tab, inter- | ’open book’, ’open note’, ’make test’, ’take test’, test open’, ’close book’, ’book exam’, ‘open | 4.9 
net, easy, search book exam’, ’exam open’, ’book test’ 

9 | people, lot, stop, say, agree, mean, proctor, | ‘people get’, ‘lot people’, ’many people’, ‘people would’, ’get catch’, ‘people like’, ’mental | 4.8 
probably, maybe, care health’, ‘people go’, ’know people’, ’feel like’ 

10] like, feel, sound, look, yeah, lol, bad, thing, lot, | “feel like’, ’seem like’, ‘look like’, ’sound like’, ’something like’, ’even though’, ’would like’, | 4.8 
shit ’make feel’, ’online school’, ’like people’ 

11] exam, proctor, final, online, open, book, sheet, | ’take exam’, ’final exam’, ’open book’, ’online exam’, ’make exam’, ’proctor exam’, ’write | 4.7 
time, hour, note exam’, ‘take home’, "home exam’, "person exam’ 

12] use, software, proctor, proctorio, computer, | “lockdown browser’, ’secondary device’, ’make sure’, ‘proctor software’, ’take exam’, ’get | 4.5 
browser, note, flag, lockdown, webcam flag’, ’student use’, ’use respondus’, virtual machine’, ’use note’ 

13] course, year, average, math, midterm, final, as- | ’first year’, ’take course’, ’last year’, ’math course’, ’feel like’, ’midterm final’, ’year course’, | 4.5 
signment, fail, term, quiz *course average’, ’final exam’, ’class average’ 

14] class, curve, online, semester, average, fail, | ‘take class’, ’class average’, ’online class’, class get’, ’one class’, “feel like’, ’math class’, | 4.4 
homework, lot, easy, problem ’class take’, ’in person’, ’make sure’ 

15] grade, curve, average, semester, high, final, let- | ’good grade’, letter grade’, ’final grade’, ’get good’, ’get good grade’, ’grade get’, ’get grade’, | 4.2 
ter, higher, better, good ’grade inflation’, ’grade curve’, ’better grade’ 

16] professor, happen, try, evidence, accuse, report, | "professor make’, ’take exam’, ’make exam’, ’professor would’, ‘professor might’, ’make sure’, | 4 
tell, prove, probably, email *student professor’, ‘professor try’, ’in person’, ’tell professor’ 

17] catch, happen, wonder, lol, hear, dumb, expel, | ’get catch’, ‘people get’, ‘people get catch’, ’first time’, ’catch people’, ’catch get’, use chegg’, | 3.7 
time, Imao, guy ’get away’, ‘without get’, ‘without get catch’ 

18] chegg, post, account, use, ip, information, ad- | ’use chegg’, ‘ip address’, ’chegg account’, ’get catch’, ‘post chegg’, ‘question chegg’, ‘post | 2.8 
dress, answer, view, solution question’, ’chegg exam’, ’chegg answer’, ’answer chegg 

19} group, chat, leave, join, share, report, quiz, | ‘group chat’, ’share answer’, ‘get trouble’, ’group member’, ‘join group’, ’class group’, ‘leave | 2.5 
snitch, post, want group’, ‘academic integrity’, ’group project’, ’study group’ 

20] make, harder, sure, sense, hard, easier, difficult, | ’make sure’, make harder’, ’make exam’, ’make sense’, *harder make’, ’want make’, ’make | 1.4 
mistake, thing, pretty mistake’, ’make difficult’, ’make feel’, want make sure 


instructors to combat cheating and reduce grades, such as grading 
on a curve (Topics 14 and 15) and using anti-cheating and online 
proctoring software (Topics 9 and 11). 


Next, students reported feeling demotivated when they know that 
cheating happens in examinations (Topics 4 and 5) and often goes 
unpunished (Topics 3, 10 and 17). Students discussed examples of 
cheating that instructors failed to identify, such as seeking answers 
on Google and question-answering websites such as Chegg (Topics 
7, 8 and 18), and discussing solutions in online chat groups (Topic 
19). 


Finally, students reported concerns about new methods used to pre- 
vent cheating in online examinations. They worried that some legit- 
imate actions may be misconstrued as cheating: looking away from 
the computer screen, accidentally pressing a button, or disconnect- 
ing from a video meeting due to internet connectivity issues (Topics 
6 and 12). Furthermore, some students reported being accused of 
cheating during online examinations, but did not realize they did 
anything wrong (Topics 2 and 16). 


4. CONCLUSIONS 
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Logistic regression analysis suggests that cheating-related discus- 
sions on academic subreddits have expanded from plagiarism in 
computer programming (representative of Fall 2019) to online as- 
sessments in general. The word ‘chegg’ was associated with Fall 
2020 content, suggesting an increase in the use of Chegg and re- 
lated websites, which is consistent with prior work [6, 3]. Further- 
more, words indicating online proctoring were predictive of Fall 
2020 content, e.g., ‘camera’, ’webcam’ and ’record’. Inspection of 
the posts and comments containing these terms revealed students’ 
concerns about their privacy during online examinations. Similar 
concerns were raised in recent work [4, 9]. 


Topic modelling analysis identified three discussion themes in Fall 
2020. First, students believe that cheating causes grade inflation, 
which motivates instructors to make assessments harder and intro- 
duce strict anti-cheating protocols such as not being able to scroll 
back to a previous question on an online examination. Some of 
these concerns have been highlighted in previous work [18, 19, 2, 
4, 13, 3], and our analysis reflects students’ opinions on this topic. 
Second, unpunished cheating lowers students’ morale and motiva- 
tion. Students report feeling demotivated when classmates cheat 
and obtain high grades. Third, students report not knowing exactly 
what constitutes cheating and what is allowed, underscoring the im- 
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portance of clear academic integrity policies. These concerns were 
often reported in the context of online examinations, with students 
unsure of how their actions are being monitored. 
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APPENDIX 


Table 4: Words with the most positive and most negative logis- 
tic regression coefficients 


Term coefficient | Term coefficient 
chegg 2.19 | sheet 3 
online 1.79 | cheat sheet -2.95 
proctor 1.79 | code -1.87 
open 1.62 | project -1.68 
covid 1.55 | plagiarism -L51 
zoom 1.45 | phone -1.47 
prof 1.37 | plagiarize -1.32 
pandemic 1.25 | relationship -1.31 
proctorio 1.11 | sit -L1 
flag 1.09 | talk -1.02 
cheat 1.08 | sexual -0.98 
chat 1.06 | notice -0.94 
camera 1.03 | bring -0.93 
internet 1 | textbook -0.93 
privacy 1 | international -0.92 
book 1 | misconduct -0.78 
cheater 0.98 | appeal -0.78 
webcam 0.95 | program -0.79 
100 0.93 | go -0.79 
format 0.92 | front -0.81 
screen 0.9 | report -0.81 
open book 0.89 | try cheat -0.81 
sem 0.88 | ask -0.81 
record 0.88 | homework -0.82 
math 0.88 | dean -0.82 
term 0.87 | practice -0.83 
average 0.86 | allow -0.88 
respondus 0.85 | partner -0.88 
email 0.83 | final -0.89 
semester 0.83 | english -0.9 
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ABSTRACT 


Co-operative education is a form of work-integrated learning that 
includes academic study and paid work experience. This provides 
new learning opportunities for students and a talent pipeline for 
employers, but also requires participation in a competitive job mar- 
ket. We study competition through a unique dataset from a large 
North American co-operative program, in which students and em- 
ployers rank each other after a round of interviews, then a match- 
ing algorithm assigns students to jobs based on the ranks, and fi- 
nally students and employers evaluate each other at the end of the 
workterm. Our results reveal insights about competition and its 
impact on decision-making and satisfaction. An analysis of com- 
mon ranking patterns suggests that small employers appear to be 
more strongly affected by competition and consider more options 
in their rankings, whereas large employers often do not provide 
any backup options and only identify their top choice. Addition- 
ally, competition appears to affect satisfaction since employers give 
higher workterm evaluations when matched with their top choice. 


Keywords 


co-operative education, work-integrated learning, ranking 


1. INTRODUCTION 


Co-operative (co-op) education is a form of work-integrated learn- 
ing that includes both academic study terms and paid work expe- 
rience, referred to as co-op work placements, workterms or intern- 
ships. Prior work has examined the benefits of co-op, such as new 
learning opportunities for students and a talent pipeline for employ- 
ers {13}. However, recent work has also reported that the compe- 
tition related to interviewing for and securing co-op placements is 
a source of stress for students {10}. Motivated by these findings, 
in this paper we take a closer look at competition in co-operative 
education. 


Our study is based on a unique dataset from a large North Ameri- 
can undergraduate co-operative program. In this program, the co- 
op employment process proceeds as follows. Employers post job 
advertisements, students submit applications, and employers select 
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students they wish to interview. After a round of interviews, stu- 
dents and employers rank each other. A matching algorithm then 
assigns students to jobs based on the ranks, with the goal of min- 
imizing the sum of the student and employer ranks. For example, 
if the employer offering job A ranks student B one and vice versa, 
then the algorithm is guaranteed to assign job A to student B. In 
some cases, however, students and employers may be matched with 
their second or third choices, or not be matched at all. Finally stu- 
dents and employers evaluate each other at the end of the workterm. 


One way to characterize competition in such a process is to iden- 
tify job postings that receive the most applications. However, even 
entry-level or less desirable job postings may receive many applica- 
tions, mainly from junior students. Instead, we turn to the ranking 
step of the process as a novel way to characterize competition. We 
investigate the following questions: 


1. Do employers use different ranking strategies that reflect the 
level of competition they face? For example, an employer 
who is confident in their ability to attract top students may 
rank their preferred student one and not rank any other stu- 
dents as backup options. On the other hand, a less confident 
employer may rank multiple students. 


2. Does competition appear to affect satisfaction? Are em- 
ployers happier if they are matched with their top-ranked 
choices? 


To answer these questions, we analyze ranking and workterm eval- 
uation data from over 4,500 employers participating in the job 
matching process in three semesters, from September 2015 to Au- 
gust 2016. We answer the first question by mining frequent ranking 
patterns and identifying representative attributes of employers that 
use these patterns. To answer the second question, we compare the 
average employer evaluation scores when matched with their first 
choice versus a backup choice. 


Related Work: Labour market competition has been studied from 
several angles, including improving talent recruitment by recom- 
mending resumes to job postings {16}. and reducing turnover 
by assessing personnel fit when making hiring decisions (2) (6). 
Further, it was found that job seekers’ perceptions of hiring success, 
informed by their past job search success and prior knowledge of 
the company, motivate their decision to apply for a job and affect 
their decision to accept a job offer [12]. In co-operative edu- 
cation, there has been work on student and employer satistfaction 
(8)[5} (4), as well as on clustering job opportunities, suggesting that 
junior students compete with each other for entry-level jobs and 
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senior students compete with each other for more advanced posi- 
tions [14]. To the best of our knowledge, this is the first work 
to characterize competition based on employer rankings in a co-op 
process. Access to this unique data allows us to draw new insights 
into competition in co-operative education that can help manage 
students’ and employers’ expectations and improve their satisfac- 
tion. 


2. DATA AND METHODS 


2.1 Co-operative Process Overview 

We begin with an overview of the co-op process at the institution 
studied in this paper. Initially, participating employers submit job 
descriptions, and any student (enrolled in a co-op program) may 
apply to any job. Next, employers interview selected candidates 
and rank them. A rank of zero, referred to as a “No Rank", means 
that the employer is not willing to hire the student. A rank of one, 
referred to as an “Offer", indicates that the employer wishes to hire 
the student. Ranks two to nine, referred to as “Ranks", represent 
the employer’s backup or shortlist options, in order of preference. 
In other words, the employer would consider hiring these students 
if the top-ranked student declines the offer. In the remainder of 
this paper, we use the terms “shortlisted” and “received a Rank” 
interchangeably. Ranks do not need to be distinct, e.g., an employer 
may put five students on the backup list and give all of them a rank 
of two. After employers have submitted their rankings, students 
rank employers that made them offers or shortlisted them, between 
one and nine, indicating their order of preference. 


The co-op matching system then removes student-employer rank 
pairs that add to zero (i.e., No Ranks) and applies a matching al- 
gorithm to assign students to jobs. The objective of the algorithm 
is to minimize the sum of the ranks of the resulting student-job as- 
signment. Note that the lowest sum of ranks is two, and occurs 
when an employer offers a job to a student and the student gives a 
rank of one to this job. In this case, the student is guaranteed to be 
matched with this jot{'] In other cases, students or employers may 
be matched with their second, third, or lower choice, or may not 
be matched at all. Finally, at the end of a workterm, students and 
employers who were matched with each other evaluate each other. 


2.2 Data 


We analyzed one year of data, from September 2015 to August 
2016, corresponding to 4,851 co-op job postings for students en- 
rolled in co-op engineering programs: 


e Job Postings, containing a job ID, job title, and employer 
name. 


e Employer Rankings, containing a job ID and the distribu- 
tion of ranks. Figure [I] shows an example with five em- 
ployers, one per row. The first row indicates that employer 
(whose job ID is) El gave two ranks of zero (#RO) and no 
other ranks, i.e., El interviewed two students and was not 
willing to hire either of them. The second row indicates that 
E2 interviewed two students, rejected one (#RO), and put one 
on the shortlist with a rank of two (#R2), and so on. 


e Employer Evaluations, containing a job ID, the rank the 
employer gave to the student who was hired, and the em- 
ployer’s evaluation of the student (on a 7-point scale: unsat- 


'If a student were to give a rank of one to multiple Offers, the 
algorithm would randomly select one of these Offers. 


#RO #R1 #R2 #R3 #R4 #RS #RO6 #R7 #R8B #RO 
El1}/2/0{|0/0/]0|0/0/]0j|0]0 


E2/1;/0{1/0/]0|0/0/]0|0/]0 
E3}/4/2|0/0/0|0/0/]0|0/]0 
F4/3;}1{;1/0;]0|0/]0/]0|0]0 
ES} 1{]1{]1]1]1{1{]0/]0{|0{0 


Figure 1: Sample of employer ranking data 


: Frequent Ranking 
Bal g Data Ranking Patterns Strategies 
1. Identify 2. Group 
frequent similar 3. Inspect 
ranking ranking groups 
patterns patterns 


Figure 2: Summary of methods 


isfactory, marginal, satisfactory, good, very good, excellent, 
outstanding). 


2.3 Methods 


Given that the matching algorithm is designed to minimize the sum 
of the ranks of the student-job assignments, employers may use dif- 
ferent ranking strategies depending on the perceived level of com- 
petition. For example, employers may extend one or more offers 
but not shortlist any students if they are confident that their offer(s) 
will be accepted (i.e., that those students will reciprocate with a stu- 
dent rank of one). On the other hand, less confident employers may 
shortlist multiple students, and, to maximize their chances of hir- 
ing someone, they may give a rank of two to all shortlisted students 
instead of ranking them in order of preference. 


The goal of this paper is to identify these kinds of ranking strategies 
and use them to describe the level of competition faced by different 
groups of employers. Our methodology, consisting of three steps, 
is summarized in Figure[2]and explained below. 


1. Identify frequent ranking patterns: For employers, we 
identify commonly used sets of ranks. For example, an em- 
ployer set of ranks of {0, 1} corresponds to employers who 
give only No Ranks (0) and Offers (1), and do not shortlist 
any students (ranks 2-9). 


2. Group similar ranking patterns: Informed by the previous 
step and by the nature of the matching process, we group 
together similar sets of ranks. We refer to these as ranking 
strategies. For example, we may group employer rank sets 
of {0,1,2}, {0,1,2,3} and so on and label these as employ- 
ers who make a shortlist (in addition to making some offers 
and rejecting some students). This step partitions employers 
according to their ranking strategies. 


3. Inspect groups: We compare groups of employers with dif- 
ferent ranking strategies based on their a) characteristics and, 
b) consequences on matching and evaluation. To identify dif- 
ferences among employers who use different ranking strate- 
gies, we inspect employer names and job titles. To under- 
stand the consequences of ranking strategies on matching 
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Percentage of Ranks 


0 1 2 3 4 5 6 Wi, 8 9 
Employer Rank 


Figure 3: Distribution of employer ranks 


Table 1: Most frequent sets of ranks given by employers 


Set of Ranks % 


{0, 1, 2} 24 
{0, 1} 19 
{0,1,2,3} 14 
{1} 8 

{0} 5 
(13 3} 4 
{1,2} 4 
{0, 1,2,3,4} 4 
{0, 2} 2 
(i, 2,3,4). 2 


and evaluation, for each group of employers who use a given 
ranking strategy, we calculate, a) the percentage who were 
not matched (represented as %NoMatch), b) the percentage 
who were matched with their first choice (represented as 
%MatchR1), and c) the percentage who were matched with 
their >1 choice (represented as %MatchR>1). Finally, we re- 
port the average evaluation scores that the employers gave to 
the students they matched with at the end of the workterm. 


3. RESULTS 


Section analyzes the rankings given by 4,851 employers to 
identify frequent ranking patterns (Step 1 of Figure[2), group them 
into ranking strategies (Step 2 of Figure[2), and distinguish between 
employers with different ranking strategies (Step 3 of Figure [2). 
Section [3.2| analyzes the effects of ranking strategies on matching 
and satisfaction (Step 3 of Figure[2). 


3.1 Employer Ranking Strategies 

Figure[3]shows the distribution of ranks given by employers to stu- 
dents they have interviewed. Recall that rank 0 or “No Rank” in- 
dicates that the student was interviewed but not considered for the 
job, rank 1 represents an offer, and ranks 2-9 represent employers’ 
shortlists in order of preference. As seen in Figure [3] nearly half 
the ranks are zero, a quarter are offers, and ranks lower than three 
are rare. 


Next, Table[i] shows the most frequent sets of ranks given by em- 
ployers. Many employers reject at least one student (rank 0), make 
at least one offer (rank 1), and shortlist at least one student, usually 
with ranks of 2 and/or 3. 19% of employers make offers without 
shortlisting anyone (second row: {0,1}). 


Using Figure[3}and Table] we group employers with similar rank- 


ing patterns (Step 2 of Figure 2). Table [2] summarizes the groups. 
The first column, Label, describes each group. For example, the 
first group corresponds to employers that do not make any offers 
and do not shortlist (Rank) any students — that is, they only give 
zero ranks, meaning that they are not willing to hire any students 
they interviewed. The second and third columns indicate whether 
the employers in the given group gave any Offers and Ranks, re- 
spectively (we define Top Ranks to be ranks of two or three). The 
next column shows the percentage of employers assigned to each 
group (e.g., the first row indicates that 5% of employers did not 
give any Ranks or Offers). The next column reports the percent- 
age of employers that were not matched with any students by the 
algorithm, labelled “%NoMatch”; clearly, employers who did not 
give any ranks or offers have no-match rate of 100%. The next col- 
umn, “%MatchR1”, shows the percentage of employers that were 
matched with their first choice and the average evaluation score the 
employers gave to these students (higher is better). Finally, the last 
column, “%MatchR>1”, shows the percentage of employers that 
were matched with a student who was not their first choice and the 
average evaluation score the employers gave to those students. We 
will discuss the percentages further in Section[3.2| 


Note that the sum of the percentages reported in the last three 
columns — “%NoMatch” plus “%MatchR1” plus “%MatchR>1” — 
is 100 for each row. In other words, there are three possible options 
for employers: does not match with any student, matches with their 
first choice, or matches with their not-first choice. 


To characterize employers with different ranking strategies, we in- 
spected their names and job titles (Step 3 of Figure 2). We found 
that employers who gave: 


e “No Offer/s or Rank/s" (first row of Table [2) consisted of 
companies of all sizes and industries, mainly offering “ana- 
lyst" and “assistant" positions. 


e “Only Rank/s" (second row) were mainly business units of 


the institution, and mostly offered “analyst", “support”, and 
“intern" positions. 


e “Only Offer/s" (third row) consisted of large well-known 
technology and manufacturing companies, offering “soft- 
ware developer” and “design” positions. 


e “Offer/s and Top Rank/s" (fourth row) consisted of (a) 
medium-sized companies offering positions in “software de- 
velopment" and “data science", (b) large companies with po- 
sitions such as “application development", “UI designer", 
“quality assurance", and “process improvement", and (c) 
companies with specialized jobs in the fields of electrical en- 
gineering, hardware, medical engineering, banking, etc. 


e “Offer/s and Other Rank/s" (fifth row) consisted of small 
to medium-sized companies with job titles including “qual- 


ity assurance", “software testing", “support technician", and 
“systems administrator". 


3.2 Consequences 


Strategies 
This section analyzes how ranking strategies used by employers 
affect their chances of finding a match and whether employers with 
different ranking strategies evaluate their matches differently at the 
end of the workterm. To provide context, we start by reporting the 


of Employer Ranking 
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Table 2: Employer ranking strategies 


Label Offer/s Rank/s % YoNoMatch %MatchR1 %MatchR>1 
(Avg(Eval)) (Avg(Eval)) 

No Offer/s or Rank/s No No 5 100 - - 

Only Rank/s No Yes 9 81 - 19 (5.9) 

Only Offer/s Yes No 27 48 52 (6.1) - 

Offer/s & Top Rank/s Yes Yes (all r<3) 48 31 46 (6.1) 24 (5.9) 

Offer/s & Other Rank/s Yes Yes (some r>3) 11 20 48 (6.2) 32 (5.9) 


matching percentage and evaluation scores averaged across all the 
employers who participate in the ranking process (i.e., those who 
give at least one non-zero rank). 


Overall, 39% of employers who participate in the ranking process 
do not find a matching student. Out of the 61% who find a match, 
75% match with their first choice (i.e., with a student to whom they 
gave an Offer), and 25% match with their >1 choice. On average, 
employers who match with their first choice evaluate their students 
slightly higher (6.1) than those who match with their >1 choice 
(5.9). This difference is statistically significant at a p-value of 0.05. 


Next, we analyze the consequences of employer ranking strategies 
on matching and evaluation. Recall that Table[2|shows the percent- 
age of employers with different ranking strategies who do not find 
a match, match with their first choice (i.e., with a student to whom 
they gave an Offer), and match with their >1 choice. For each rank- 
ing strategy, we also show the average evaluation scores given by 
employers to their students. Among employers who make offers, 
those who provide more backup options (Offers and Ranks) have a 
higher matching rate, with a greater proportion matching with their 
backup choice. Additionally, one-fifth of the employers who “Only 
Rank" (i.e., do not make any offers and only use ranks of two and 
above) find a match. 


On average, regardless of the ranking strategy used, employers who 
match with their first choice evaluate their students similarly and 
so do employers who match with their >1 choice. In addition, ir- 
respective of the ranking strategy used, employers who match with 
their first choice evaluate their students slightly higher than those 
who matched with their >1 choice. Therefore, while employer 
ranking strategy affects their chances of finding a match, employ- 
ers with different ranking strategies do not evaluate their students 
differently. 


4. DISCUSSION AND CONCLUSIONS 


In this paper, we proposed a new way of characterizing competition 
in a co-operative job market by studying how employers rank stu- 
dents after a round of interviews. Based on a dataset from a large 
co-operative education program, we identified ranking strategies, 
studied the characteristics of employers who use different strate- 
gies, and analyzed the effects of ranking strategies on matching 
and workterm evaluation. Our main findings are as follows. 


Ranking strategies characterize competition: Ranking strategies 
may be used to characterize the extent of competition in the co- 
op job market; employers appear to be aware of the competition 
they face and rank accordingly. 


Large employers are less likely to provide backup options. Thus, 
it appears that these employers are confident in their ability to hire 
their top choices, and if their top choices decline the offers, these 


employers are willing to risk not hiring any students from this uni- 
versity. Small to medium companies, especially those offering 
entry-level positions, are more likely to provide backup options, 
perhaps as a consequence of perceived competition for their top 
choices. Therefore, an employer’s popularity and quality of job 
they offer appear to be correlated with the ranking strategy they 
use. Similar observations were made in many competitive environ- 
ments, including supply chains and legal contracting, where parties 
with more bargaining power leverage their reputation when negoti- 


ating with others P| (75). 


Rank of match affects satisfaction: Regardless of the ranking strat- 
egy, employers who match with their first choice evaluate their co- 
op students slightly higher on average than those who match with 
their backup choices. In other words, satisfaction only seems to de- 
pend on the rank of the match and not on the strategy used to obtain 
the match. 


These results should be interpreted carefully since they are based 
on data from a single institution. However, the methodology we 
presented in this paper may be used by others to reflect the extent 
of competition in their institutions through ranking patterns. In ad- 
dition, this study is limited to identifying frequent patterns in the 
data, but not cause-and-effect relationships. This provides a start- 
ing point for further study: interviewing students and employers 
about the competition they face in the co-op market is an interest- 
ing direction for future work. Nevertheless, we believe that our 
findings will be of interest to students, employers and the institu- 
tion. We provide several examples of actionable insights below. 


1. Our results can help students understand how co-op employ- 
ers rank their options in various situations. This may inform 
students’ strategies and decision-making during applications 
and ranking, in turn, increasing their chances of finding a 
suitable co-op job. 


2. Our results can inform new employers about the extent of 
competition in the co-op market, which in turn can help them 
decide how to rank their options given the competition they 
are likely to face. 


3. Our findings indicate that some employers are confident in 
their ability to hire their top choices, indicating that such jobs 
are highly sought after by students. The institution may con- 
sider recruiting more such employers. On the other hand, 
the institution may recommend smaller employers to less- 
experienced students to increase their chances of finding a 
match. 


4. Our findings suggest that employers who match with a 
backup choice are less satisfied with their co-op students. 
This suggests a need for methods to help manage the expec- 
tations of employers and students in this situation. 
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ABSTRACT 


High-stakes digital-first assessments are assessments that 
can be taken anytime and anywhere in the world and their 
scores impact test takers’ lives. Computational psychomet- 
rics, a blend of theory-driven psychometrics and data-driven 
algorithms, provides the theoretical underpinnings for these 
data-rich assessments. The unprecedented flexibility, com- 
plexity, and high-stakes nature of these digital-first assess- 
ments poses enormous quality assurance challenges. In or- 
der to ensure these assessments meet both “the contest and 
the measurement” requirements of high-stakes tests [5], it 
is necessary to conduct continuous pattern monitoring and 
be able to promptly react when needed. In this paper, we 
illustrate the development of a quality assurance system, 
Analytics for Quality Assurance in Assessment (AQuAA), 
for a high-stakes and digital-first assessment. To build the 
system, educational data from continuous administrations 
of the assessments are mined, modeled and monitored via 
an interactive dashboard. 


Keywords 
high-stakes assessment, digital-first assessment, quality as- 
surance 


1. INTRODUCTION 


Digital-first assessments are based on artificial intelligence 
(AI) tools that direct and optimize test-takers’ experience. 
These digital tools include automatic systems for test de- 
velopment, scoring, and test delivery. In contrast to tradi- 
tional large-scale assessments that are based on in-person 
administration to large groups of test takers in fixed loca- 
tions, digital-first assessments are administered continuously 
to individual test takers, thus allowing for unprecedented 
flexibility. The advantages of the digital-first assessments 
have manifested themselves during the pandemic when tra- 
ditional group assessments in brick-and-mortar test centers 
became impractical. 
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When digital-first assessments are used for high-stakes pur- 
poses (for example, for admissions or employment purposes), 
they, as any traditional high-stakes assessments, have a sig- 
nificant potential impact on test takers’ lives. Thus, the 
digital-first high-stakes assessment also need to meet both 
“the contest and the measurement” requirements of high- 
stakes tests [5], where the ”contest” here refers to the expec- 
tation that the test gives everyone a fair chance; the ”mea- 
surement” refers to the requirement that the test is accurate 
and valid. 


Quality assurance refers to a systematic process to maintain 
the high quality of the test and assessment scores and to pre- 
vent errors from all stages of the test, including test design, 
item design and development, test scoring, test analysis and 
score reporting [7]. Its complement, quality control, refers 
to a set of methods and statistics to evaluate the quality 
of the test. Many of the statistics and methods employed 
for quality assurance and quality control are similar, with 
quality control being part of the quality assurance overarch- 
ing system. The International Test Commission Guidelines 
have articulated step-by-step procedures for quality control 
of general educational assessments but many of the steps are 
more applicable to traditional assessments, that is, “large- 
scale testing operations where multiple forms of tests are 
created for use on set dates.”[7] 


Since digital-first assessments differ from traditional assess- 
ments in many respects (e.g., administration frequency, item 
bank size), it is necessary to develop quality assurance pro- 
cedures that are tailored for digital-first assessments. Devel- 
oping such systems also requires research into the appropri- 
ate methodology to identify the most relevant statistics to 
be monitored for such new type of assessment, which is the 
focus of this paper. 


In order to conduct quality assurance for digital-first high- 
stakes assessments, we developed a monitoring system named 
Analytics for Quality Assurance in Assessment (AQuAA), 
which is a blend of psychometrics and educational data min- 
ing packed into a dynamic and interactive dashboard-based 
system. AQuAA was designed to accommodate at least 
two unique characteristics of the digital-first assessments. 
On one hand, many key aspects of digital-first assessments, 
such as item generation and scoring, are automatically ac- 
complished by machine. Therefore, compared to traditional 
assessments, the quality assurance of the digital-first as- 
sessments requires more extensive data mining techniques. 
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Computational psychometrics [18, 17] is leveraged to mine 
and model educational data in order to develop the statis- 
tics included in AQuAA. On the other hand, as a conse- 
quence of the continuous nature of administration, the qual- 
ity assurance activities for digital-first assessments need to 
be conducted more frequently with a flexible timeline. In 
addition, tools that facilitate swift and efficient communica- 
tion are indispensable so that prompt actions can be taken 
when issues are detected. In AQuAA, a variety of statis- 
tics are updated regularly and are integrated into an in- 
teractive dashboard for continuous pattern monitoring and 
timely communication purposes. AQuAA is also symbiotic 
with other activities (such as item development) given the 
fact that conclusions drawn from AQuAA could be used to 
direct the maintenance and improvement of the assessment. 


This paper elaborates the development of AQuAA and aims 
to address three research questions: 1) What statistics should 
be used as indicators of test quality and score validity of 
digital-first assessments? 2) How to identify patterns and 
irregularities relevant to test quality of digital-first assess- 
ments? and 3) How to communicate the findings from the 
quality assurance process to stakeholders? This paper is fo- 
cused on the the quality assurance of the test administration 
activities. 


2. RELATED WORK 


Quality assurance plays an important role in maintaining 
test score validity. [1] indicated that mistakes that jeopar- 
dize the assessment score validity could occur at all stages 
of assessment development and administration and that the 
mistakes could accumulate since many stages are contingent 
on previous stages. Therefore, quality control guidelines 
and step-by-step procedures [1, 2, 7] have been developed 
to help test developers identify possible mistakes as well as 
the causes of these mistakes, thereby helping them to iden- 
tify solutions to fix the mistakes and prevent the mistakes 
from happening again. 


Quality control procedures were mostly designed for tradi- 
tional large-scale assessments that are administered in only 
a few test dates and have large test volumes in each adminis- 
tration [7, 1], with [2] being an exception. [2] recommended 
a quality control procedure for continuous mode tests (i.e., 
tests that are administered to small groups of test takers on 
many test dates) which share some similarities with digital- 
first assessments. Moreover, [2] have demonstrated an auto- 
mated quality control system for continuous mode tests and 
the system consists of both an automatic part and a human 
review part. These two parts also apply to the quality as- 
surance of digital-first assessments. In the automatic part, a 
number of steps that need to be conducted recurrently and 
can be implemented programmatically are packed into an 
automatic procedure with the use of digital tools. Steps in 
such an automatic procedure may include fetching the data 
from the database, conducting a variety of quality control 
analyses (see [9] for a review of quality control methods) and 
generating statistical reports. In the human review part, 
human experts are trained to review the statistical reports 
generated from the automatic procedure in order to identify 
potential irregularities or outliers, and determine whether or 
what actions need to be taken to handle these irregularities. 


The foundation of an automated quality assurance proce- 
dure consists of a wide range of data mining and data visu- 
alization techniques. In the realm of quality assurance, the 
data mining and data visualization techniques serve two ma- 
jor purposes: First, to describe the trends and seasonal pat- 
terns of the assessment statistics; Second, to detect abrupt 
changes in the relevant assessment statistics. [9] have sum- 
marized a number of statistical methods and data visualiza- 
tion techniques for score quality assurance purposes. Vari- 
ous time series techniques can be chosen to describe trends 
or seasonal patterns, which include linear ANOVA models 
[4], regression with autoregressive moving-average [10], har- 
monic regressions [8] and dynamic linear models [19]. The 
Shewhart chart is a useful data visualization tool for contin- 
uous of the test score characteristics [9, 12, 14]. In terms of 
detecting abrupt changes in the assessment statistics, some 
model-based approaches have been applied to mine the data 
and identify abrupt changes in score time series, such as 
change-point models and hidden Markov model [9]. A data 
visualization techniques for detecting abrupt changes is cu- 
mulative sum (CUSUM) charts [13]. 


The products of the automated quality assurance proce- 
dure may include summary tables of the statistics, graphs 
and statistical testing results [2]. These statistical products 
could be organized into different formats, such as reports [2] 
and dashboards [11]. Since the products of the automated 
quality assurance procedure will serve as the starting point 
of the human review process [2], the choice of organizing 
format should be determined by the ease of communication 
to the targeted stakeholders. 


3. MAJOR COMPONENTS OF AQUAA 


This section illustrates how several key components of AQuAA 
address the research questions mentioned above. AQuAA 
has been launched as a minimum viable product (MVP) 
and additional features and statistics are being added to 
the system. This paper demonstrates the application of 
AQuAA the Duolingo English Test, a digital-first assess- 
ment. In order to help readers understand the context from 
which the AQuAA is developed, this section will start with 
a brief overview of the Duolingo English Test. However, the 
methodologies for designing AQuAA and the statistics con- 
sidered for evaluation are intended to be adaptable to other 
digital-first assessments. 


3.1 Overview of the Assessment 

The Duolingo English Test is a high-stakes computerized 
adaptive test that is designed to be accessible anywhere and 
anytime [15]. Thus, it also falls under the category of con- 
tinuous mode assessments [2]. The Duolingo English Test 
is an adaptive test, with a very large item bank that has 
been designed by subject matter experts (SMEs) and pro- 
duced automatically by the machine. The items are reviewed 
by panels of SMEs to ensure quality and cultural fit. The 
items are scored automatically and the scoring methods are 
reviewed periodically by SMEs. Each individual test is proc- 
tored remotely using a complex and innovative asynchronous 
system that involves both Al-based tools and human proc- 
tors. Discrepancies or unusual situations are adjudicated 
by SMEs. Test results are reviewed through the quality as- 
surance process in AQuAA. As part of this process, a wide 
range of process information related to test takers’ behavior 
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Import data Check and \ Track metrics Identify patterns Communicate 
from database/ cleandata/ and statistics/ and irregularities results 
Figure 1: AQuAA updating procedure 


(e.g., time per item response, length of responses, etc.) is 
analyzed and monitored for quality assurance. The amount 
of data and the multiple sources and types of data are sig- 
nificantly more demanding of sophisticated analytics than is 
the case in more traditional assessments. 


3.2 Overview of AQUAA 

An overview of the procedure of developing and updating 
AQuAA is shown in Figure 1. Except for the first step (i.e., 
importing the data) that is relatively straightforward, the 
design of each step requires deliberation and, thus, is elab- 
orated in the following sections. The steps in Figure 1 are 
scheduled to be automatically implemented on a daily basis 
(and in some cases more frequently). R [16] is the major 
programming tool used to develop AQuAA and automate 
the AQuAA updating process. 


3.3. Checking and Cleaning Data 

In general, the assessment data used for AQuAA can be sep- 
arated into two types: Person-level data and item-response- 
level data. Person-level data contain variables that describe 
the overall person/session information, such as test takers’ 
overall test score, sub-scores, test dates, and background 
characteristics. Item-response-level data contains variables 
that delineate information about each item the test taker 
responded to, such as item IDs, item difficulty levels, item 
responses and item scores, and other process information 
such as time duration test takers spent on each item. 


After the data are imported, the integrity of the data is 
inspected to ensure that the data used for subsequent anal- 
yses are accurate and of high quality. For example, data are 
inspected for irregular values (e.g., negative values in time 
duration variables), and the causes of any such values are 
further investigated to identify any potential threats to the 
integrity of the data collection process. 


3.4 Tracking Metrics and Statistics 

The first research question is to determine what metrics and 
statistics are most relevant to monitor over time in order to 
evaluate the health of a continuous assessment. In order 
to support a statistical quality assurance system, AQuAA 
monitors results in the following five categories across time, 
adjusting for seasonality effects. 


1. Scores. Test scores are directly used by test users (e.g., 
test takers, institutions), thus important indices at the 
level of test scores, including overall scores, sub-scores, 
and item type scores, are tracked in AQuAA. Score- 
related statistics include the location and spread of 
scores, inter-correlations between scores, bivariate or 
multivariate outliers, person fit, internal consistency 
reliability measures and standard error of measure- 
ment (SEM), and validity coefficients (e.g., correlation 
with self-reported external measures). 


2. Test taker profile. The composition of the test taker 
population is tracked over time, as it could be used to 
explain the variability in test scores to some extent. 
Specifically, the (percentage) volume of test takers in 
the important population categories, such as country, 
native language, gender, age, intent in taking the test, 
and other background variables, are tracked. In ad- 
dition, many of the score statistics are tracked across 
major test taker groups. 


3. Repeaters. Repeaters are defined as those who take the 
test more than once within a 30-day’ window. The 
prevalence, composition, and performance of the re- 
peaters are tracked. The composition of the repeater 
population is defined with respect to the same test 
taker profile categories discussed above. The perfor- 
mance of the repeater population is tracked with many 
of the same test score statistics identified above, with 
additional statistics that are specific to repeaters: lo- 
cation and spread of both the first and second tests, as 
well as their difference, and test-retest reliability (and 
SEM). 


4. Item analysis. As tests consist of items, ensuring that 
items are of high quality and that the item quality is 
stable over time are the prerequisites of maintaining 
the validity of the test scores. In AQuAA, item qual- 
ity is quantified with four categories of item perfor- 
mance statistics: Item difficulty, item discrimination, 
item slowness (response time), and differential item 
functioning (DIF). Tracking these statistics would help 
test developers to develop expectations about the item 
bank with respect to item performance, flag items with 
extreme and/or inadequate performance, and detect 
drift in measures of performance across time. 


5. Item exposure. The item exposure statistics concern 
how frequent each item (or each group of items) are 
used. An item being used either too frequently (over- 
exposure) or too infrequently (under-exposure) are un- 
desirable for maintaining the item quality. An impor- 
tant statistic in this category is the item exposure rate, 
which is calculated as the the number of test adminis- 
trations containing a certain item divided by the total 
number of test administrations. Tracking the item ex- 
posure rates can help flag under- or over-exposure of 
items. 


3.5 Identifying Patterns and Irregularities 
The second research question concerns the identification of 
patterns and irregularities in the data, which involves the 
development of the alarming mechanism of AQuAA. De- 
veloping the alarming mechanism in AQuAA is challenging 
partly due to the fact that the population of test takers is 
evolving and changing constantly, and, thus, many of the 
tracked metrics cannot be assumed to be stationary over 
time. Instead, the tracked metrics are often prone to sys- 
tematic variation over and beyond predictable changes due 
to seasonality effects, thereby making it complicated to set 
an appropriate alarming criteria for the alarming mecha- 
nism. 


‘The day threshold of determining repeaters could be ad- 
justed based on the test taking policy and the research pur- 
pose 
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The alarming mechanism in AQuAA is intended to detect 
persistent but smaller trends as well as alert large and abrupt 
changes that may be due to a problem in the assessment. To 
achieve these goals, we combined model-based psychomet- 
ric analyses method with the time series and control charts 
techniques, both of which are useful for distinguishing sys- 
tematic changes from chance variation in outcome processes. 


The psychometric model-based methods allow us to track 
metrics after adjusting for certain factors (e.g., test tak- 
ers’ background characteristics), thus increasing the metrics’ 
comparability over time. Specifically, in AQuAA, the item 
statistics and metrics are adjusted for test taker ability and 
background variables, and test taker statistics and metrics 
are adjusted for item characteristics. 


3.6 Communicating Results 

Our third research question involves how to communicate 
the information to the operational analysts as well as to 
the business unit. To visualize the trends and patterns of 
the statistics and facilitate the communication in the hu- 
man review process, statistics are plotted using the ggplot2 
R package [20]. Line plots are one of the most basic tools to 
visualize the time-series data. For example, Figure 2 demon- 
strates the stable trend of the mean of the overall test score 
during the Fall of 2020. Each dot in these figure represent a 
statistic calculated using a day worth of data; the lines are 
smoothed lines created by the locally weighted scatterplot 
smoothing (LOWESS) [3] method in order to represent the 
trends of the statistics. 


Plots are also used to visualize the alerts raised by the 
AQuAA alarming mechanism introduced in Section 3.5. In 
AQuAA, the alerts are classified into three severity cate- 
gories which are represented by different color codes. Specif- 
ically, yellow, orange and red represent low, medium and 
high levels of severity, respectively. For example, Figure 3 
displays a monitoring plot for the daily median response 
time a few alerts in low severity. Once an alert is raised by 
AQuAA, messages are automatically sent to inform all the 
relevant stakeholders via email and the organization com- 
munication tool. 


Various statistics and figures are integrated into an interac- 
tive dashboard using the flexdashboard [6] package. Figure 
A.1 demonstrates the layout of the dashboard. At the top 
of the dashboard (i.e., Section 1), there are five tabs cor- 
responding to the five categories of statistics articulated in 
Section 3.4. Within each tab, the relevant statistics are ar- 
ranged into storyboards: The statistics could be further clas- 
sified into subcategories and allocated into different pages 
(ie., Section 2); figures are displayed at the major section 
of the dashboard (i.e., Section 3); text description and some 
numerical results are displayed in the commentary section 
(i-e., Section 4). 


4. THE APPLICATION OF AQUAA 


As the quality assurance of digital-first assessments is a 
combination of automatic processes and human review pro- 
cesses, the AQuAA system is used as the starting point for 
the human review process, and the human review process, 
in turn, helps AQuAA to evolve into a more powerful tool to 
detect assessment validity issues. Figure B.1 demonstrates 
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Figure 2: Trend of daily mean overall scores 
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Figure 3: Trend of daily median response time with alerts. 


an example human review process following every week’s up- 
dates of AQuAA: SMEs meet to review the alerts raised by 
AQuAA alarming mechanism and review for any anomalies 
that are suggested by the AQuAA figures but have not been 
caught by the AQuAA alarming mechanism. The SMEs re- 
view each individual alert and determine whether it is an 
actual sign of a validity issue or it is a false alarm. If the 
alarm is believed to be caused by a validity issue, follow-up 
actions are taken to determine the severity and urgency, fix 
and document the issue. If the issue had not been caught by 
the AQuAA alarming mechanism, improvements would be 
made to the AQuAA functionality such that AQuAA would 
be more sensitive in detecting the issue. 


5. DISCUSSION 


This paper demonstrates the development of a quality assur- 
ance system that is tailored for digital-first assessments that 
are continuously administered. Several research questions 
motivated many of these approaches, as very few of the tra- 
ditional methods apply to the digital-first assessments. The 
steps and considerations for building the quality assurance 
system have been elaborated, so that test developers could 
adapt the methodologies in this paper to their own assess- 
ments. It should be noted that the list of quality assurance 
statistics presented here is not exhaustive. Instead, due to 
the data-rich nature of the digital-first assessment, the list 
of monitoring statistics is expected to be lengthened and 
improved as the research in statistical techniques advances. 
The list of monitoring statistics should also be customized to 
the purposes and characteristics of the assessment. Hence, 
the infrastructure of AQuAA is designed to be so flexible as 
to incorporate and monitor additional statistics. 
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APPENDIX 
A. DEMO OF AQUAA 
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Figure A.1: Demo of AQuAA with annotations. Section 1 is 
the navigation bar containing five tabs corresponding to the 
five categories of statistics monitored in AQUAA. Within each 
tab, the relevant statistics are grouped into subcategories and 
are arranged into storyboards. Section 2 display the pages 
that correspond to the subcategories of statistics. Section 
3 is the major section of the dashboard where figures are 
displayed. Section 4 is the commentary section that display 
the text description and numerical results. 


B. SUBJECT MATTER EXPERT REVIEW 
PROCESS 
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Figure B.1: Subject Matter Expert (SME) review process. 
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ABSTRACT 


This study illustrated an exploratory study of LMS log data 
from undergraduate fully-online flipped classrooms. A total 
of 237 students’ instructional video watching behaviors were 
extracted from LMS, and were analyzed with background 
variables to predict students’ final performance. Regular- 
ization was proposed a suitable machine learning technique, 
as it produces interpretable prediction models. Specifically, 
Enet (elastic net) and Mnet were employed to handle pos- 
sible multicollinearity in LMS log data, and the prediction 
models of Enet and Mnet identified 19 and 21 important 
predictors of final performance out of 157, respectively. In 
particular, both regularization models were able to screen 
lower-performing students as early as the first week of the 
course. Mere attempts to watch difficult videos after class 
increased the final scores. 


Keywords 
LMS log data, machine learning, regularization, flipped class- 
room, performance modeling 


1. INTRODUCTION 

The COVID-19 pandemic has changed the education system 
worldwide. Online learning is no longer an option, and an 
increasing number of online classes incorporate components 
of flipped classrooms (FC) in an effort to improve the quality 
of learning and instruction. Despite varying results regard- 
ing the effectiveness of flipped learning in higher education 
[1, 2, 3, 4], FC has grown rapidly as an innovative peda- 
gogical approach in recent decades. In FCs, students’ active 
involvement in pre-class activities is greatly emphasized as a 
necessary condition to enhance in-class learning and instruc- 
tion [5]. However, there has been little empirical research on 
whether students completed the assigned pre-class activities 
and whether pre-class activities lead to desired outcomes. 


This may relate to analytical limitations of the previous re- 
search in terms of data and methods. First of all, learning 
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management system (LMS) log data are a crucial source 
of information in order to capture students’ learning activ- 
ities. However, not all the studies on FC collected data 
from LMS, particularly when pre-class assignments do not 
involve activities in LMS. For instance, the assignment of 
reading materials cannot be properly recorded outside of 
LMS. Researchers can ask students ex post facto in a self- 
report survey. However, self-report questionnaires rely on 
memories and reflections, and thus are prone to social de- 
sirability bias. On the other hand, LMS log data unobtru- 
sively collect near-real-time information; students’ activities 
in LMS are automatically stored in the log files without the 
students’ cognizance [6, 7, 8, 9]. Particularly in the COVID- 
19 situation, fully-online FCs has emerged. In fully-online 
FCs both pre-class and in-class activities take place online 
using platforms such as LMS, and therefore collecting trace 
data has become much easier than in the original FCs. 


Next, there is room for improvement in terms of analysis 
methods. Despite the aforementioned advantages that log 
data bring to data analyses, the intractability of log data 
has been a practical hindrance. Log data are unstructured, 
which can lead to high-dimensional data (i.e., more variables 
than observations), depending on data pre-processing and 
cleaning. Previous research on LMS log data to model stu- 
dents’ achievement have analyzed students’ behavioral data 
(e.g., instructional video watching behaviors) [6, 10, 11, 7, 
12] as well as background (e.g., gender) [10, 13] and exam 
data [11, 13]. In particular, behavioral data were used as a 
tool to measure students’ self-regulated learning [6, 10, 11, 
7, 13, 12, 14, 15, 16], but aggregate variables such as total 
login frequencies or average login hours were analyzed with 
traditional methods [13, 15] or early ML (machine learn- 
ing) techniques [14, 16]. As traditional methods are likely 
to result in nonconvergence problems with high-dimensional 
data, previous research may have used aggregate variables. 


However, study time relevant to a specific instructional unit 
can be traced from log data, which will serve as a better 
indicator than the sum of study time, a crude measure of 
time investment in studying [15]. Such detailed information 
in turn will be conducive to understanding learning and in- 
struction and giving specific, targeted, and timely feedback 
to students. This relates back to the issue of the previous 
LMS log data research: lack of empirical research on the re- 
lationship between pre-class assignments and students’ per- 
formance at an instructional unit level. Particularly when 
behavioral variables at an instructional level are to be ana- 
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lyzed, ML is a necessary technique to analyze LMS log data 
from fully-online FCs. 


Since completing pre-class assignments and preparing for 
interactive in-class activities is critical in FC, a high level 
of self-regulated learning (SRL) is necessary for students 
to succeed. SRL strategies related to students’ academic 
success such as effective time management, metacognition, 
effort regulation, and critical thinking have been shown to 
have a significantly positive effect on students’ academic suc- 
cess [17]. The question is which behaviors indicate SRL. Stu- 
dents carrying out SRL would naturally include more time 
on attending lectures and self-study which have a positive 
effect on academic achievement [18]. Previous studies have 
used variables such as login frequencies, LMS menu usage, 
material download, content pages viewed, and posted mes- 
sages [6, 10, 11, 13, 14, 15, 16]. However, aggregate mea- 
sures of these data display inconsistent effects on student 
achievement. For instance, login frequencies [13, 16] and 
LMS menu usage [13, 14, 15] were statistically significant or 
important indicators to students’ academic achievement in 
online learning. In contrast, in MOOC (massive open on- 
line course) environments, forum variables such as numbers 
of messages posted, or comments received were found to be 
not directly related to students’ learning [11]. 


Constant effort put into preparing for FCs may be difficult 
to capture with aggregate data. That is, instructional unit 
based log data would be a better predictor for academic 
success. A study predicting online student performance [19] 
demonstrates that the study habits of students with high lev- 
els of academic success can even be observed even in the first 
few weeks of a course. The implication is that instructional 
unit based analysis could yield richer information about the 
study patterns of students which eventually leads to timely 
intervention by the instructor. 


Among ML, this study proposes regularization. Although 
*prediction’ is the operative word in ML, learning analytics 
is one of the fields which needs to be augmented with expla- 
nation. Regularization or penalized regression is known to 
produce explainable prediction models. Based on linear re- 
gression, the regression coefficients of regularization can be 
interpreted in the similar way as those in traditional, non- 
penalized regression. This is a great advantage in LMS data 
analysis, as prediction models need to be interpreted under 
certain educational settings, for instance to plan more effec- 
tive intervention strategies for at-risk students. There has 
been little study employing regularization methods in LMS 
log data analysis. Specifically, this study chose Enet [20, 21] 
and Mnet [22] among regularization as they handle multi- 
collinearity, a likely challenge in LMS data analysis. The 
two main research questions were as follows: 


1. What are the students’ instructional video watching be- 
haviors like at an instructional unit level? Do students 
complete pre-class assignments in fully-online undergradu- 
ate flipped classrooms? 


2. Among students’ behavioral and background variables, 
which variables are important to predict students’ academic 
achievement? 


2. MACHINE LEARNING 


For a Gaussian family, Enet and Mnet are expressed as equa- 
tions 1 and 2, respectively. The second term on the right- 
hand side of equation 1 is the penalty function of Enet, con- 
sisting of two tuning parameters: \ and a. Enet is a com- 
bination of LASSO and ridge. The parameter \ regularizes 
shrinkage of the coefficients, and the parameter a controls 
the amount of ridge. When a is 1, equation 1 reverts to the 
LASSO equation, and when a is 0, it reverts to the ridge 
equation. Aforementioned, by adding the ridge component 
to the equation, Enet can handle multicollinearity. 


BEnet = 
. 1 K 2 K 
argming | 4 |ly — Dy Xcel] + ADL, (@ [Bell + U- 
@) ||Se|")] - 
(1) 
1 = ? 
AMnet : 
B = argming | > vo Dates 


K K 
+ SJ (Bell [Axs7) + 2 at . 
=1 


k=1 k= 


—~iy7?+4) < 
where Fod1.1) = { ay + |e], |al < Ar M 


SYAT, |z| > YAa 
(2) 


Enet uses convex penalties, which increase linearly regard- 
less of the coefficient size. By contrast, Mnet uses a concave 
penalty, which tapers off for coefficients in larger absolute 
values, yielding nearly consistent coefficient estimates[22]. 
Mnet has three tuning parameters (equation 2). The pa- 
rameter A; has the same regularization function as the A 
penalty in Enet (equation 1). The y parameter of Mnet 
controls the concavity of the convex penalty. When the con- 
cavity penalty goes to infinity, the MCP penalty reverts back 
to the LASSO penalty. Mnet also deals with multicollinear 
data; the penalty associated with A2 adds the ridge compo- 
nent to the equation. 


To consider the bias resulting from data-splitting in model 
validation, this study executed subsampling techniques for 
variable selection [23, 24]. The following three steps were 
repeated 100 times with random data-splitting. First, the 
whole data were randomly divided with the ratio of 7:3 to 
get the training and test data, respectively. Second, for a 
value of the penalty parameter, the training data were split 
with the ratio of 4:1 to execute 5-fold CV. For a value of 
A, the prediction error is calculated, which was referred to 
as the CV error of the A [20]. Third, the second step was 
repeated for every \ in range, and the A of the lowest CV 
error served as the penalty value of the regularization. That 
X value was appplied to the test data in step 1, which yielded 
prediction measure. 


The selection or non-selection of each variable from step 2 
was counted in the 100 iterations, which served as the selec- 
tion counts of the study. Particularly, this study presented 
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variables selected 1, 25, 50, 75 times or more, and all 100 
times [25, 26]. All the programs were written in R 3.6.2. 
Specifically, the grpreg library [27] was used for regulariza- 
tion. 


3. MATERIALS 


In the Fall semester of 2020, 242 undergraduate students in 
pre-service teacher program enrolled in 8 fully-online under- 
graduate classes titled Measurement and Evaluation. The 
classes of the Fall semester were mandatory for sophomores 
majoring in Liberal Arts and Social Sciences. Three instruc- 
tors (A, B, C) including a head-instructor (A) taught the 8 
classes. All the 8 classes scheduled a simultaneous final at 
the end of the course, and shared the same class materials 
including instructional videos, textbooks, and syllabus. The 
instructional videos were pre-recorded PowerPoint presenta- 
tions with the head-instructor talking, with content based 
on a book also written by the head-instructor. There were 
a total of 34 video clips covering 11 instructional topics in 
the corresponding 11 instructional weeks (refer to the videos 
01_1 to 11.4 in Appendix A). 


On the orientation day of the first week, the importance of 
the weekly assignments of instructional video watching be- 
fore class were emphasized, particularly because students 
were asked to create and complete class projects within 
groups based on the contents of the assigned videos. During 
class, interactions in small groups of 4 to 6 students were 
greatly encouraged. The groups were engaged in discus- 
sions on team projects and SPSS exercises in Zoom break- 
out rooms. A non-mandatory quiz of 4-5 short questions was 
presented for each week in LMS. Students were told that the 
quizzes would serve as formative assessments and the quiz 
scores did not count toward grades. 


In total, 21,589 rows of video watching activities as well 
as 5,107 rows on board-posts readings were recorded in the 
log file. As many of the students indicated that they used 
the double-speed option of the LMS in video watching, this 
study used 50 % of the video length as a criterion. If a 
student watched a video 50 % of the length or more, the 
student is counted to have completed watching the video, 
and vice versa. 


As the first research question was to investigate students’ 
video watching behaviors at an instructional unit level, this 
study counted the frequencies of each video, separating be- 
fore/ after and attempted but incomplete/ completed video 
watching. Specifically, 4 variables were created for each 
video: BI (incomplete attempt before class), BC (complete 
watching before class), AI (incomplete attempt after class), 
and AC (complete watching after class). Six Aggregate vari- 
ables were also obtained for comparison purposes to previous 
research: BI, BC, and B (before-combined (I+C)) for before 
class counts; and AI, AC, and A (after-combined (I+C)) for 
after class counts. 


The response variable of this study was final. The final 
test consisted of 35 multiple-choice items, and was given 
simultaneously to all the 242 students at the last week of 
the course. There were 5 students who missed the final, and 
those students’ data were excluded from further analysis. 
The background and response variable were merged to the 


variables from LMS data, which resulted in the final dataset 
of 157 predictors of 237 students. 


4. RESULTS 
4.1 Students’ Video Watching Behaviors 


Table 1 summarizes the descriptive statistics of students’ in- 
structional video watching behaviors. The 6 groups of cells 
present the summary results of the aggregate variables. Stu- 
dents watched the videos more often after class than before 
class. Throughout the course students on average attempted 
to watch and completed watching each video about 1.03 and 
1.08 times after class, respectively, while the values dropped 
to 0.20 and 0.23 before class (Table 1). The mean values 
smaller than 1 indicate that the students on average did not 
watch all the videos. Attempts and completions combined 
(I+C), students on average clicked about half of the videos 
before class (0.42), but they clicked each of the videos more 
than twice after class (2.11). 


The range of students’ video watching frequencies was quite 
wide. Some students clicked none of the videos after class 
(AI min= 0.00), while others after class clicked and finished 
watching each video as many as 4.20 (AI max= 4.20) and 
2.47 times (AC max= 2.47), respectively. The maximum 
frequencies of before class watching were also less than those 
of after class, 1.38 and 1.00 for incomplete and complete 
watching, respectively. 


4.2 Machine Learning Results 
4.2.1 RMSE and Selection Counts 


RMSE (root mean square error) was the prediction measure 
of the response variable of this study. The RMSE averages 
of Enet and Mnet were 5.58 and 5.69 with SDs of 0.50 and 
0.46, respectively. 


Consistent with literature [28, 22], Mnet always selected 
fewer variables than Enet. Of note, 103 and 94 predictors 
were selected out of 157 at least once with Enet and Mnet, 
respectively. This signifies the importance of running mul- 
tiple iterations and employing selection counts, particularly 
when the research purpose is variable selection via regular- 
ization [25, 26]. In other words, employing selection counts 
considers the bias resulting from random data-splitting in 
model building. 


Applying 25 or more selection counts resulted in 33 and 21 
predictors for Enet and Mnet, respectively. A total of 19 and 
3 predictors were selected at least 1 out of 2 runs of Enet 
and Mnet, respectively. Four predictors were selected with 
3 out of 4 runs of Enet, but there was no such predictor with 
Mnet and no predictor was selected in all the 100 iterations. 


4.2.2 Selected Variables 


This study on log data analysis presents the summary of 
predictors selected 50 or more for Enet and 25 or more for 
Mnet in Table 2. Due to space limit, part of the results are 
discussed. Student gender and grade were selected impor- 
tant. When the other variables were held constant, male 
students had lower final score than female students. In- 
terestingly, on-grade students, sophomores, tended to have 
lower scores. Students’ attitudes toward measurement and 
evaluation (attitudes) also resulted in higher scores. 
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Table 1: Students’ Before and After Watching Frequencies per Video 


I (incomplete 


C (complete) I+C (combined) 


M SD | min | max 


SD | min | max M SD | min | max 


B (before class) | 0.20 | 0.18 | 0.00 | 1.38 | 0.23 


0.08 | 0.06 | 1.00 | 0.42 | 0.21 | 0.08 | 1.62 


A (after class) | 1.03 | 0.78 | 0.00 | 4.20 | 1.08 


0.38 | 0.00 | 2.47 | 2.11 | 1.00 | 0.06 | 6.24 


Among variables extracted from log data, the total num- 
ber of clicks on SPSS material postings (spss.sum) and the 
numbers of quiz-taking (test.M and test.P) were important 
predictors to final. More clicks on SPSS postings lead to 
higher scores on final. Specifically, one more click on the 
SPSS material increased students’ final scores by 0.11 and 
0.16 for Enet and Mnet, respectively. Similarly, although 
students knew that the scores on quizzes did not count to- 
ward the final grade, simply taking the quizzes increased the 
final scores regardless of the device (mobile or PC). Students 
who watched the instructional videos mobile (lecture.M) also 
tended to have higher scores in final. 


Among the 142 variables on video watching, 12 to 13 vari- 
ables were selected as important depending on the regular- 
ization method (Table 2). Findings from the 13 selected vari- 
ables are as follows. First, the very first video turned out to 
convey crucial information in predicting students’ achieve- 
ment, although it covered the easiest contents on forma- 
tive assessment. The more the students completed watching 
the first video before class (BCO1_1), the higher their final 
scores were. Specifically, one more completed watching of 
the first video before class increased students’ final score by 
0.57 in Enet and by 0.68 in Mnet. By contrast, the more 
the students completed watching the first video after class 
(ACO1_1), the lower the final scores were. One more com- 
pleted watching of the first video after class decreased final 
scores by 0.45 and 0.53 in Enet and Mnet, respectively. This 
(ACO1_1) was the only AC variable of negative relation to 
the final. 


Second, with the exception of ACO01_1, the other AC vari- 


Table 2: Coefficients of Selected Predictors by Reg- 
ularization 


Bamahie Enet Mnet 

mean | SD mean | SD | # 
1 | gender -0.54 | 0.35 -0.8 | 0.48 | 36 
2 | on-grade -0.61 | 0.22 -0.75 | 0.31 | 56 
3 | test.M 0.34 | 0.12 0.48 | 0.15 | 38 
4 | test.P 0.23 | 0.10 0.31 | 0.15 | 49 
5 | lecture.M 0.01 | 0.01 0.02 | 0.02 | 30 
6 | attitudes 1.31 | 0.49 1.65 | 0.58 | 61 
7 | spss.sum 0.11 | 0.05 0.16 | 0.08 | 37 
8 | BCO1_1 0.57 | 0.26 0.68 | 0.35 | 50 
9 | ACOL1 -0.45 | 0.17 -0.53 | 0.30 | 38 
10 | ACO3_1 0.26 | 0.16 0.39 | 0.28 | 25 
11 | AC03_2 0.34 | 0.22 0.46 | 0.29 | 31 
12 | Al04_2 -0.19 | 0.08 -0.23 | 0.13 | 40 
13 | AC04_3 0.40 | 0.26 0.55 | 0.34 | 28 
14 | BIO6_2 -0.46 | 0.18 -0.59 | 0.23 | 34 
15 | AC09_3 0.29 | 0.19 0.39 | 0.29 | 44 
16 | AC10_2 0.44 | 0.24 0.61 | 0.32 | 44 
17 | AI11_2 0.38 | 0.18 0.61 | 0.29 | 27 
18 | AC11_4 0.29 | 0.21 0.42 | 0.29 | 35 
19 | AT11_4 0.33 | 0.16 0.40 | 0.21 | 37 
20 | on-semester 1.39 | 1.09 | 25 
21 | Al04_3 -0.21 | 0.13 | 27 
Note. # indicates the number of selection in 100 iterations. 


ables (e.g., AC03_1, AC03_2, AC04_3, AC09_3, AC10_2, AC11_4) 5. DISCUSSION 


had positive relations with the final. The selected AC vari- 
ables covered either the earlier technical contents or the 
most difficult concepts at the end. Particularly, the earlier 
technical contents included the first SPSS practice (AC03_1, 
AC03_2) and Ebel and Angoff standard setting (AC04_3). 
Cronbach’s alpha (ACO09_3), reliability with SPSS (AC10_2), 
and the relationship between reliability and validity (AC11_4) 
covered the most difficult concepts in the last weeks of the 
course. Students who completed watching these videos mul- 
tiple times after class were more likely to obtain higher final 
scores. 


Third, the relationship of AI variables to the final seems to 
depend on the class progress. Students who attempted but 
failed to complete watching the video on Ebel and Angoff 
standard setting covered in the fourth instructional week 
had lower scores on final (AI04_3). By contrast, incomplete 
watching of some videos on the last topic (covered in the 
last instructional week) were positively related to the final 
(AI11_2 and AI11_4). Of note, both AC and AI variables 
on video 11_4, the last video, had positive correlation coef- 
ficients. 


This study predicted students’ final scores with as few as 19 
to 21 predictors out of 157 with regularization techniques. 
Of note, the prediction models of this study are explain- 
able, as we employed regularization, which is based on lin- 
ear regression. Specifically, Enet and Mnet were employed 
to handle multicollinear data. Surprisingly, the prediction 
models differentiated lower-performing students as early as 
the first instructional week, right after the orientation week. 
Instructors now can invest their efforts in intervention with- 
out waiting until a quiz or an exam. Completing difficult 
videos multiple times after class also lead to higher scores 
in the final. Moreover, mere attempts to watch them after 
class also increased the scores. 


Despite its importance in FC, it has been a foggy area whether 
students completed the pre-class activities or not and whether 
the pre-class activities lead to desired outcomes. This study 
also contributed to partly uncover what was going on be- 
hind the curtain of FC. The students on average completed 
at most 1/5 of the videos before class. Stronger links need 
to be established between pre-class assignments and in-class 
team projects. 
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APPENDIX 
A. VIDEO IDS AND LABELS 


video ID label 
1 Ll formative assessment 
2 21 variables and scales 
3 sampling 
4 descriptive statistics 
5 descriptive statistics (SPSS) 
6 41 norm-referenced evaluation 
7 42 criterion-referenced evaluation | 
8 43 Ebel and Angoff standard setting | 
9 51 measuring affective domains | 
10 5_2 observation | 
11 5.3 interviews | 
12 5A survey | 
13 6_1 performance assessment: definition | 
14 6_2 performance assessment: scoring | 
15 71 test construction steps | 
16 72 multiple-choice items | 
17 73 constructed-response items | 
18 7A scoring caveats | 
19 8.1 item difficulty and discrimination I | 
20 8.2 covariance and correlation | 
21 8.3 item difficulty and discrimination II | 
22 8.4 item difficulty and discrimination (SPSS) | 
23 91 introduction to reliability | 
24 9.2 types of reliability | 
25 9_3 Cronbach’s alpha | 
26 9_4 standard error of measurement | 
27 9_5 factors influencing reliability | 
28 10_1 objectivity and reliability | 
29 10_2 reliability (SPSS) | 
30 10_3 objectivity (SPSS) | 
31 111 content validity | 
32 11.2 criterion-related validity | 
33 11.3 construct validity | 
34 11_4 reliability and validity 
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ABSTRACT 


Undergraduate college students have substantial flexibility in 
choosing the order in which they take courses, since most courses 
either have no prerequisites or only a single prerequisite. However, 
the specific order that courses are taken can have an impact on 
student performance. This paper describes a general methodology 
for assessing the impact of course sequencing on_ student 
performance, as measured by course grades, and applies this 
methodology to eight years of undergraduate academic data from 
Fordham University. The results demonstrate that certain course 
orderings are associated with improved student grade performance. 
This study introduces a methodology, new metrics, and a publicly 
available data-processing tool that can be applied to any student 
course-grade data set to measure course sequencing effects. The 
results can be used to inform student decisions, modify course 
recommendations, and even modify course prerequisites. 


Keywords 


Data mining, education, course sequencing, student performance 


1. INTRODUCTION 


Undergraduate university students have substantial flexibility in 
choosing what courses they take and when they take them. Course 
sequencing is usually enforced only by a modest set of course 
prerequisites. This study examines the impact of different course 
sequences on student learning outcomes, as measured by course 
grades. The data used in this study includes eight years of 
undergraduate student grade data from Fordham University. Prior 
studies on course sequencing have generally been quite limited. 
Similar research has focused more on course selection, the optimal 
set of courses for a student to take to maximize performance or time 
to graduation [4, 5], than on course sequencing. Studies that 
focused on course sequencing were limited to a single discipline, 
such as communications [7] and psychology [2]. Our study 
considers all undergraduate courses within the university, including 
sequences that span disciplines. Prior studies also only considered 
how early courses predict performance in later courses, whereas our 
study does not have this restriction and focuses instead on 
maximizing overall student performance. 


Our study considers the impact of sequencing on pairs of courses. 
This simplifies the analysis and reduces the risk of finding spurious 
correlations. The grade performance of students taking each pair of 
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courses in the two possible sequential orderings is measured, with 
the goal of identifying the ordering that yields the best overall 
performance (concurrent registrations are excluded from the 
analysis). Comparing the grade performance for the two sequences 
required the development of new metrics, which we consider to be 
one of the contributions of this research. The methodology 
described in this study, along with the metrics that are introduced, 
are embodied in a publicly available software analysis tool [6]. 


Every possible course-pair sequence is considered as long as there 
are a sufficient number of students to provide reliable results. 
However, our analysis focuses primarily on course sequences 
within certain departments and groups of departments. This focus 
is due to our affiliation with a Computer Science department and 
the current focus on STEM (Science, Technology, Engineering, 
and Mathematics) education that is driven by national interests and 
the needs of industry. We also examine course pairs that include 
both humanities and STEM courses, because we are interested in 
the role that a liberal arts education has on STEM education. 


There are many factors that can impact instructor performance [8], 
such as class size, course workload, and time of day of a class [1]. 
These factors also will impact student performance and hence can 
interfere with our ability to draw clear conclusions about course 
sequencing effects. In the present study, we normalize the grade 
data at the course section level to account for different instructor 
grading schemes, but do not address the other confounding factors. 
Our expectation is that the large number of course sections 
associated with most courses will limit the impact of these factors. 


There are several uses for the course sequencing analysis described 
in this paper. The most obvious is that this information can be used 
to improve recommendations provided to students concerning 
beneficial course orderings. When these benefits are substantial 
enough, official course prerequisites can be modified. Beyond these 
direct applications of the work, the sequencing results can provide 
insight into the relationships between courses, and this can be used 
to inform academic policies. For example, if Course A is not 
generally considered relevant to Course B, but nonetheless leads to 
improved student performance in Course B, then one might want to 
recommend Course A to students who must take Course B. 


2. METHODOLOGY 


This section describes the data set used, the data preprocessing and 
transformation that is necessary to convert the data into a form 
suitable for analysis, and the evaluation metrics that measure the 
impact of course sequencing. 


2.1 Initial Student Course-Grade Data Set 


The initial data set describes the grade performance of each 
undergraduate student in all course sections with at least five 
students. Each of the 473,527 data set records, which collectively 
cover 24,969 distinct students, identify a student, a course 
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(including the course section and semester), the instructor, and the 
student’s grade in the course. Although we aggregate the 
information to course level, section information is used to 
normalize student grades. Unfortunately, the initial data set cannot 
be made publicly available due to strict student privacy laws. 


2.2 Data Preprocessing and Transformation 
The analysis conducted in this study is based on pairs of courses. 
From the initial student course-grade data set, we compute and 
maintain information for each course sequence A>B and B>A, 
where A and B represent arbitrary courses. For each of these 
sequences, we maintain a list of all students taking the two courses 
in the corresponding order, and the grades they receive in each 
course. The particular section each student enrolls in is also tracked, 
so that grades can subsequently be normalized at the section level. 
The transformation of the data from the student course-grade level 
to the course-pair sequence level, and the generation of our 
evaluation metrics, are accomplished using our publicly available 
Python-based tool [6]. 


In this study, a course pair is analyzed if it meets two conditions. 
The first condition ensures that the percentage of students taking 
the sequence in each direction exceeds MinCSP, the Minimum 
Course Sequence Percentage. For this study, MinCSP is set to 30%, 
which ensures that both orderings are taken at least 30% of the time. 
This excludes abnormal situations where a particular course 
sequence is rarely taken, such as when a student takes an 
introductory class in their senior year or retakes a failed course 
outside of the normal order. The second condition ensures that at 
least a minimum number of students, MinCount, aggregated over 
all course sections, takes the courses in each order. MinCount is 
utilized to ensure that the sample size is sufficient to generate 
reliable results. For this study MinCount is set to 50 students. 


Table 1 specifies how many course pairs remain after these 
conditions are applied. The conditions are applied sequentially, 
with MinCSP applied before MinCount. The values in the rightmost 
column reflect the number of course pairs actually analyzed. 
Table 1 displays the number of course pairs for the entire data set, 
as well as for the five course subsets that are of particular interest 
to us. Our university has no engineering school, so the STEM 
courses are offered by the Biology, Chemistry, Computer Science, 
Mathematics, Natural Sciences, Physics, and Psychology 
departments. The Humanities courses include all courses from the 
African and African American Studies, Anthropology, Art History, 
English, Philosophy, Theology, and Visual Arts departments. 


Table 1. Number of course pairs for different course subsets 


Data Set Threshold 
None MinCSP=30% | MinCount=50 

Full Data Set 81,327 21,461 1,939 
Computer Science 850 253 14 
Mathematics 392 92 23 
Mathematics and CS 1,724 490 51 
STEM 12,055 3,000 291 
STEM & Humanities 27,303 6,646 684 


2.3 Evaluation Metrics 

Several metrics are used to analyze the impact of course sequencing 
on student performance. These metrics are based on lower-level 
metrics, which are introduced first. Ultimately, we want to see how 
the mean grades for each course in a course pair are impacted by 
course order in order to determine the optimal ordering and net 
benefit in grade performance. 


The first step computes the mean grades for each course in a course 
pair for each of the two orderings. Because instructors vary widely 
in their leniency when assigning grades, all grades are normalized 
at the course section level using z-score normalization, as described 
by Equation 1. In this equation xi represents the grade of student i 
in the course section, 1 represents the mean section grade over Xi, 
and o represents the standard deviation of the section grades. 


Zi =(xi- W/o (1) 


For every course pair <A, B> we determine the average normalized 
grade for each course based on each ordering. Specifically, we 
compute U,(B > A), Ua(A > B), Ug(A > B), and ug(B > A), 
where the subscript of denotes the course for which the 
normalized mean is computed and A —> B indicates that course A 
is taken before course B (and vice versa for B > A). As an example, 
for the course pair <Math I, English I>, Uyan 1(English I>Math I) 
represents the mean normalized grade in Math I for students who 
took Math I after English I. 


These normalized means are used to compute the difference in 
mean normalized grades (DNG). Two DNG values are computed 
for each course pair <A, B> since the difference in normalized mean 
grades is computed for each course. Equations 2 and 3 define these 
values, where DNG,.p is the difference in mean normalized grade 
for Course A when Course A is taken after course B rather than 
before course B, and DNGzg., is the difference in mean normalized 
grades for Course B when Course B is taken after course A rather 
than before course A. We compute the difference using the order 
noted in the equations, because we generally expect a course to 
perform better when it is taken second and anticipate that most 
DNG values will be positive. 


DNGa.p = Ha(B > A) — Ha(A > B) (2) 
DNGg.q = Up(A > B) — Up(B > A) (3) 


The DNG equations measure the benefit of taking two courses in a 
particular order, but do not reflect the net benefit of one ordering 
over the other (if both DNG values are positive then the difference 
between the orderings will be reduced). We therefore compute the 
order benefit, OB, which is the net difference in DNG values of one 
ordering over the other. The OB is defined relative to a specific 
course ordering, as indicated in Equation 4. The OB value will be 
calculated for both possible orderings, but we will only list the one 
that is positive, which indicates the optimal course ordering. 


OB, = DNGg.4 — DNGgp (4) 


We work through an example using <Math I, English I>, assuming 
the following statistics: 


Lmatn (English I > Math I) = 0.40 
LUmatn (Math I > English I) = -0.05 
Henguisn1 Math I > English I) = 0.40 
Hengtisn1 (English I > Math I) =-0.10 


Assuming Math I takes on the role of Course A and English I 
Course B, using Equation 2, DNG,.g = 0.40 — (-0.05) = 0.45, and 
using Equation 3, DNGg., = 0.40 — (-0.10) = 0.50. Applying 
Equation 4, we get OB,_,g = 0.50 — 0.45 = 0.05. These results are 
summarized in the first row of Table 2. The assignment of the two 
courses to A and B is arbitrary, so we can reverse them, which 
corresponds to the course ordering in the second row of Table 2. 
Then, using Equation 2 and Equation 3, we get DNG,.p = 0.50 and 
DNGzg.,= 0.45, which yields an OB value of 0.45 — 50 = -0.05. The 
values of DNG,.p and DNGzg., in Table 2 are flipped when we 
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reverse the roles of A and B (compare rows | and 2). This is 
logically and mathematically required given the definition of the 
DNG metric, so the OB value of one ordering must equal the 
negative of the other. The results in Table 2 show that taking Math I 
and then English I yields an overall improvement in normalized 
grades of 0.05, whereas taking the courses in the reverse order 
yields a net deterioration of 0.05. 


Table 2. Example of a course pairing 


Course A Course B DNG,.n, DNGg, OBa ss 

Math I English I 0.45 0.50 0.05 

English I Math I 0.50 0.45 -0.05 
3. RESULTS 


This section provides selected results from our analysis, with a 
focus on the difference in normalized grades for different course 
sequences. Order benefit is our primary metric, as it summarizes 
the net benefit of a particular course sequence over the alternative, 
but DNG is also informative since it specifies the amount of benefit 
in taking one course before the other. For example, it is possible for 
two competing sequences to have identical positive DNGs, leading 
to a zero order benefit. Top order benefit results are presented for 
course sequences restricted to: Computer Science, Math, Math and 
Computer Science, STEM, STEM and Humanities, and “All 
Courses” across all disciplines. We posit explanations for some of 
the results based on our knowledge of the domain. 


The top three order benefit values for computer science courses are 
displayed in Table 3. Note that while the sequence Computer 
Algorithms — Data Mining has the highest OB value, based on the 
DNGzg., values, taking Data Communications and Networks after 
Data Mining yields a slightly greater improvement than taking 
Data Mining after Computer Algorithms. The key difference is that 
taking each of those pairs of courses in the opposite order (i.e., 
DNGz,.p) yields very different results. The two negative DNG,.p 
values in Table 3 indicate that the corresponding courses yield 
worse results when they are taken second. Specifically, students in 
Computer Algorithms perform worse when they take it second. We 
generally would not expect this to occur. This result may stem from 
weaker students who delay taking Computer Algorithms. 


Table 3. Computer Science courses with largest order benefit 


Course A Course B DNG,., DNGg., OB 


Computer Alg. Data Mining -0.110 0.233 0.343 
Data Structures Computer Organization  -0.073 0.103 0.176 


Data Mining Data Comm. & Netwks. 0.101 0.235 0.134 
A plausible explanation for the first entry in Table 3 is that Data 
Mining utilizes some knowledge of Computer Algorithms and 
hence taking Data Mining second is beneficial. While the same 
reasoning could be applied to the reverse ordering, the negative 
DNG indicates no benefit for that ordering, possibly because Data 
Mining does not teach the basics of computer algorithms. With 
respect to the entry in the second row of Table 3, the benefit of 
foundational mathematics and algorithmic knowledge provided by 
Data Structures is apparent in the somewhat more application- 
oriented Computer Organization course. 


Table 3 shows negative DNGy.g values are smaller in magnitude 
than positive DNGg., values — a finding replicated in subsequent 
tables. The presence of negative DNG,.z values may be an artifact 
of our focus on course pairs with the highest overall order benefit, 
because order benefit is maximized when DNG,.p is negative. 


Table 4 shows the results for three sequences of mathematics 
courses. The third entry is the easiest to explain. Business Finite 
Math and Finite Math cover similar material, but the former covers 
more basic material. Students are not generally expected to take 
both courses, but if they do, they most likely will take the more 
basic one first. Discrete Math provides a background in formal 
proofs, which appears to benefit from advanced mathematical 
experience (Multivariable Calculus I) and to provide benefit to 
advanced study of calculus (Multivariable Calculus IN). 


Table 4. Mathematics courses with largest order benefit 


Course A Course B DNG,.p DNGg., OB 

Discrete Math Multivar. Calc II -0.056 0.252 0.308 
Multivar. Calc. I Discrete Math -0.041 0.249 0.290 
Business Finite Math Finite Math -0.024 0.145 0.169 


Most computer science programs require several mathematics 
courses, but the specific impact of the math courses on computer 
science courses is not well understood. Table 5 explores the relation 
between the two departments, restricting the sequences to include 
one math course and computer science course. One of the more 
notable results is the entry in the first row. Both courses teach finite 
mathematics, but Structures of Computer Science is offered by the 
Computer Science department and is intended for non-majors, 
while Finite Math is offered by the Mathematics department. 
Structures of Computer Science also devotes several weeks to cover 
simple programming assignments, thereby further reducing the 
time spent on the mathematics content. For these reasons, it is 
reasonable to conclude that the sequence with the high OB value 
corresponds to taking the more basic course first. It is also 
noteworthy that Calculus I has a very positive impact on taking 
programming courses (Computer Science I and its lab) and 
Structures of Computer Science. Thus it appears that increased 
mathematical sophistication does have a positive impact on 
computer science and computer programming. This is especially 
interesting because the mathematical material in Calculus I has 
only a tangential relationship with computer science. Most 
computer science programs require calculus, and our empirical data 
justifies this requirement. 


Table 5. Math and CS courses with largest order benefit 


Course A Course B DNGa.p DNGg.a OB 

Structures of CS Finite Math -0.002 0.429 0.431 
Calculus I CSI -0.035 0.338 0.373 
Calculus I CS I Lab -0.012 0.252 0.264 
Calculus I Structures of CS -0.010 0.213 0.223 


Table 6 displays the remaining results for the three groupings of 
sequences: STEM courses, mixed STEM and humanities courses, 
and all courses without any restrictions. The first entry under the 
STEM category shows a benefit in taking Applied Calculus I after 
General Chemistry I. This ordering is typical for students on the 
Pre-Health track who wish to go to medical school, which may 
explain the high order benefit, since these students are generally 
motivated to achieve high grades. Furthermore, under the STEM 
category we find a benefit for Learning (Psychology) followed by 
Multicultural Psychology. The first psychology course in this 
sequence is a 2000 level course while the second is a 3000 level 
course, indicating yet again that there is a benefit from taking a 
more advanced course in the same discipline second. 
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Looking at the STEM & Humanities courses, students who took 
Organic Chemistry I Intro. to Cultural Anthropology did 
significantly better in both classes, as demonstrated by the 
magnitudes of the DNG values (the negative DNGy,.g indicates 
Organic Chemistry I does worse when taken second and hence 
performs better when taken first). The same pattern is replicated 
with an even higher OB when considering the Organic 
Chemistry Lab. Pre-Health students tend to take Organic 
Chemistry very early in their college career and may dominate that 
particular course ordering. 


The first row under the “All Courses” category displays the 
sequence Spanish Language & Literature — Christian Hymns with 
a very high order benefit. Students performed best in each of the 
two courses when taking them in the specified sequence. This may 
be due to the fact that Spanish literature is heavily influenced by 
Christianity, and therefore provides important background for 
students who plan to take Christian Hymns. Explanations for the 
other entries may require consultation with faculty from the 
associated departments. 


Table 6. STEM, STEM/Humanities, All courses with large OB 


Course A Course B DNG,.3 DNGg, OB 
STEM Courses 

General Chem. I Applied Calculus I -0.17. 0.400 0.570 
Intro. Astronomy Abnormal Psych. -0.187 0.309 0.496 
Learning (Psych.) Multicultural Psych. -0.021 0.419 0.440 
Intro. Bio. I Structures of CS -0.152 0.283 0.435 
Structures of CS Finite Math -0.002 0.429 0.431 
Gen. Chem. Lab I Structures of CS -0.102 0.325 0.427 
Calculus I CSI -0.035 0.338 0.373 
Intro. Bio Lab I Structures of CS -0.069 0.297 0.366 


Physics II Lab Human Physiol. Lab -0.019 0.287 0.306 


STEM & Humanities Courses 
Org. Chem. Lab I Intro. Cultural Anthr. -0.520 0.606 1.126 
Organic Chem. I Intro. Cultural Anthr. -0.330 0.554 0.884 


Forensic Science Philosophical Ethics -0.310 0.474 0.784 


Texts & Contexts Discrete Math -0.234 0.372 0.606 
All Courses 
Spanish Lang. & Lit. | Christian Hymns -0.436 0.714 1.150 


Medieval History Intro. Media Industry -0.178 0.550 0.728 
Intro. Archaeology -0.218 0.494 0.712 
Faith & Crit. Reason -0.134 0.565 0.699 
Intro Sociology -O.111 0.488 0.599 


Personality (Psych) -0.095 0.487 0.582 


Composition II 
Sociology Focus 
Calculus II 
American History 


4. CONCLUSION 


The research described in this study introduced a methodology and 
set of metrics for assessing the impact of course sequencing on 
student performance. The analysis of our results focuses on several 
disciplines, such as Computer Science and Mathematics, as well as 
higher level groupings, such as STEM courses. Many of the results 
demonstrate that there is a substantial benefit with a particular 
sequencing of courses, such as taking Finite Math after Structures 
of Computer Science or taking Computer Science I after Calculus I. 
Our methodology and metrics are implemented in our Python- 
based software tool [6], which can be used by other researchers. 


The course sequencing results in this paper can be used to assist 
with course recommendations and can be used to inform, and even 
modify, course prerequisites. For example, our results show a larger 


than expected benefit of taking calculus before a programming 
course; additional analysis and data will be needed to see if this 
extends to a broader set of mathematics courses, but if it does, then 
new prerequisites perhaps should be added. The results in this study 
also provide insight into the inter-relationships between courses 
and disciplines. 


Many of our observed results can be explained based on our 
knowledge about college education and domain knowledge of 
specific disciplines. However, in some cases explanations are not 
readily available. Our search for explanations of why one sequence 
may outperform another can also benefit from additional domain 
knowledge, as our knowledge is mainly limited to computer 
science. Course syllabi could also prove to be useful. It would also 
be very interesting to apply our methodology to data from different 
universities, and we hope to do this in the future. It would be 
informative to see if the course sequencing patterns present in our 
university hold elsewhere. Although our university is relatively 
large, in many cases the number of students taking some pairs of 
courses was relatively small, and this informed our relatively low 
MinCount threshold of 50. With more data, we could increase this 
threshold, which would diminish the impact of factors like 
instructor effectiveness. 


Our methodology normalizes for some external factors, such as 
different instructor grading schemes, but does not account for all 
factors that can impact student performance. In particular, we 
suspect that some course sequencing results are due to certain 
populations of students (e.g., Pre-Health students) taking courses in 
one particular order over another. In future work we do plan to 
consider some of these factors and modify our evaluation to isolate 
their impact. In cases where that is not feasible, we will at least 
provide summary statistics to assess the influence of these factors. 
For example, since we suspect that academically stronger students 
sometimes take courses in a different sequence than weaker or less 
motivated students, we can compare the overall GPAs of students 
taking the courses in each course ordering and note when they 
exhibit a statistically significant difference. Alternatively, we can 
normalize for overall student GPA, something that we are currently 
doing in a study on instructor effectiveness. 


One final area that we plan to pursue is better evaluation of our 
results. One way to do this is to utilize statistical significance 
testing. Given the number of potential patterns we can find with the 
large number of pairs of courses, we may need to set our p-value 
quite low. We may be able to improve this situation by limiting our 
course interactions to courses within a single department or 
between related fields (e.g., Biology and Chemistry). We can also 
validate our results by partitioning the data into a training and test 
set and subsequently verifying if the patterns found in the training 
data hold for the test data. In this regard, the differences in student 
performance can be viewed as predictions, so the standard training 
and test set evaluation methodology applies. 


The data utilized in this study is itself a valuable resource. Our 
research group has analyzed this data in a variety of ways to provide 
additional insights. Two studies have used this data to group/cluster 
courses and analyze the interrelationships between courses. One of 
these studies uses course co-enrollments to form the clusters and to 
identify hub courses [9], while the other uses the correlation 
between students grades as a similarity metric to cluster the 
courses [3]. Both of these studies used their respective notions of 
similarity to form networks of courses, and then analyzed these 
with existing network analysis techniques. 
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ABSTRACT 


This study computes the correlation of student grades be- 
tween pairs of courses in a large university. Course net- 
work graphs are then generated, where courses are repre- 
sented as nodes and courses are connected if they have a 
high degree of grade correlation. Graph mining and net- 
work analysis tools visualize the course networks, identify 
course clusters and course cliques, and compute informative 
network statistics. Results are analyzed for pairs of courses 
and courses grouped by academic department or program 
of study. Strong course similarity groupings are observed 
within scientific disciplines, between pre-health courses, and 
within subfields of computer science. No prior study using 
this notion of course similarity has been conducted. 


Keywords 


Educational, Clustering, Correlation, Network graphs 


1. INTRODUCTION 


This paper describes a method for grouping and analyzing 
courses based on similar student performance, where simi- 
larity is measured between pairs of courses using the Pearson 
correlation of the grades assigned to students who take both 
courses. A graph is then formed that represents courses as 
nodes, and has edges between course-pairs when the stu- 
dent grade correlation is above a specified threshold. The 
resulting graph is then analyzed using a variety of graph 
analysis techniques, to provide insights into the relationship 
between individual courses and course groupings. Data pre- 
processing steps are described to handle confounding factors, 
such as differing instructor grading schemes. The method- 
ology is encapsulated in a software tool that was developed 
for this study and is publicly available [5]. This study uti- 
lizes eight years of undergraduate student course grade data 
from Fordham University. The results show that there are 
strong connections between pre-health courses and courses 
within subdisciplines of computer science, and that courses 
that teach specific skills are much more highly connected to 
other courses than introductory survey courses. 


Daniel Leeds, Tianyi Zhang and Gary Weiss “Mining Course Group- 
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of The 14th International Conference on Educational Data Mining 
(EDM21). International Educational Data Mining Society, 804-808. 
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The knowledge gleaned from this research can be used to 
influence curriculum design and academic policies. For ex- 
ample, if a student performs poorly in the first course within 
a set of highly correlated courses, then they are likely to en- 
counter future difficulty; therefore, they could be asked to 
repeat the course or be offered academic assistance. Re- 
sults from this study have many possible applications, but 
as is the case with descriptive data mining tasks, it may 
take some time to discover some of them. However, we feel 
that the course correlation networks that we generate and 
the various metrics that we introduce are themselves key 
contributions, which will lead to further research in educa- 
tional data mining. This study is unique in that no other 
analysis of university courses is based on a notion of simi- 
larity that relies exclusively on student performance. One 
study, which is superficially similar, measures course similar- 
ity based on student course co-enrollments [7]. That study, 
also conducted by our research group and based on the same 
data set, uses this much more traditional notion of similarity 
to perform similar analyses; namely course network graphs 
are generated and then analyzes using existing network anal- 
ysis methods and metrics. 


2. DATASET DESCRIPTION 


Eight years of student-course records were obtained from 
three of Fordham university’s undergraduate colleges, where 
each record describes the performance of a student in a 
course section. This study restricts the data and analysis 
to: pre-health courses required for medical school admission, 
popular university core curriculum courses, and Computer 
Science and Psychology courses (a detailed analysis would 
not be possible if courses from all 83 majors were included). 
Computer Science and Psychology courses were included due 
to our affiliations with those departments, while core cur- 
riculum courses were chosen because of their prominence in 
our university and their diversity (students complete more 
than twenty core courses covering philosophy, history, for- 
eign languages, performing arts, mathematics, and science). 
Pre-health courses are included because they cover many key 
introductory STEM courses. This study will be expanded 
to other disciplines in the future. 


Table[I]summarizes the data and its distribution across the 
course categories. The core courses contribute more than 
half of the total course sections and are largely responsible 
for the data covering 20,797 students. Each record corre- 
sponds to one student in one course section and includes the 
following features: student ID, final grade, department name, 
course number, course title, semester, and section number. 
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Table 1: Distribution across Course Categories 


Course Category Records Sections Courses 
Computer Science 14,137(13% 705(15%)  53(39%) 
Psychology 18,017(17% 966(20%) 67(50%) 


Core 62,005(58%)  2,706(56%) 8(6%) 
Pre-Health 13,087(12% 434(9%) 7(5%) 
Total 107,246 4,811 135 


et Sa: Sh SEF 


The final grade uses a 4 point scale and most courses will 
have many sections. Student privacy concerns prohibit us 
from sharing the raw data, even though the student identifier 
values have been anonymized; however, the course correla- 
tion matrix central to our analysis is available (6). 


3. DATA PROCESSING 

An overview of the process for measuring similarity between 
courses is provided in Section [3.1] and the individual steps 
are described in successive subsections. The code that im- 
plements these steps is publicly available [5]. 


3.1 Overview 

The initial data set contains records that describe the per- 
formance of each student in a each course section. A variety 
of preprocessing steps are executed, as summarized in Fig. 1. 
A course correlation matrix that measures the similarity of 
each pair of courses using the Pearson correlation of student 
grades is generated in Step 7. The course network graphs, 
modularity clusters, and course cliques are then generated 
from this correlation matrix, as described in Section [4] 


2. Remove sections 3. Normalize 


with low standard grades in each 
deviation of grades section 


1. Remove records 
with missing 
student grades 


6. Remove course 5. Determine 4. Merge all 

pairs with too few common students sections of each 

common students for each course pair course 
Ne 


9. Compute cliques 
and modularity 


8. Remove course 


7. Compute 
paired 
correlations 


pairs below 
threshold clusters 


Figure 1: Overview of data processing steps 


3.2 Initial Data Cleaning (Steps 1 and 2) 

The first step removes records that do not have numeri- 
cal grades, such as courses taken pass/fail. Some instruc- 
tors sometimes assign students very similar grades, which 
makes it difficult to assess the similarity of courses based 
on grades. For this reason, Step 2 removes course sections 
where the standard deviation (o) of student grades is below 
a specified threshold. This requires aggregation of the stu- 
dent course records to the section level, which yields 4,811 
sections. Fig. 2 provides the distribution of standard devia- 
tion values across these sections, and also provides a curve 
that shows the number of records and percentages of sec- 
tions that are kept for each standard deviation threshold 
value (for each value we discard the sections with a lower 
threshold). Based on Fig. 2 we consider the values of 0.20, 


0.30, and 0.40 to be reasonable candidates that maintain the 
majority of course sections. We ultimately selected a thresh- 
old of 0.30, which drops 6% of the sections and eliminates 6 
courses (which are not left with any sections). 
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Figure 2: Distribution of grade standard deviation 


3.3. Grade Normalization (Step 3) 

Instructors may be easy or hard graders, and these differ- 
ences will cause problems with grade correlation when a 
course is taught by multiple instructors. This issue is reme- 
died by applying z-score normalization to the grades in each 
course section, which substracts the mean section grade from 
each grade and then divides it by the standard deviation of 
the section grades. 


3.4 Generate Course-Pair Grades (Step 4-6) 


Step 4 aggregates the data from the section level to the 
course level, which may combine dozens of course sections, 
spanning many years. Step 5 then forms pairs of courses, 
keeping on the grade data from students common to both 
courses. Course pairs are formed from every course that 
remains after application of the o = 0.3 threshold in step 2. 
Step 6 then filters the course pairs that do not have at least 
20 students in common, to ensure that the grade correlation 
is meaningful. This results in the removal of 4,585 (25%) of 
the remaining course pairs. 


3.5 Compute Paired Correlations (Step 7) 
The final preprocessing step computes the Pearson correla- 
tion between the remaining course pairs, which gener- 
ates the correlation matrix that is central to our analysis. 
A small sample of the correlation matrix is provided in Ta- 
ble 2] The complete correlation matrix is publicly available 
. Entries in the correlation matrix are not impacted by 
order, so values above the diagonal are omitted. Null val- 
ues occur when a course pair does not have enough common 
students. In Table [2] we see that, as expected, there is a 
high correlation (0.94) between Discrete Structures and the 
associated lab. There is also a strong correlation (0.81) be- 
tween Computational Neuroscience and General Physics I, 
which may be due to the heavy use of mathematical model- 
ing of physical systems in both classes. Bioinformatics and 
General Physics I exhibit a low correlation (0.19), perhaps 
reflecting a heavier practical programming focus in the bioin- 
formatics course. It is surprising that Discrete Structures 
and Computer Algorithms have a relatively low correlation 
(0.37), since they both require similar mathematical reason- 
ing skills. This suggests that the Discrete Structures may 
not be preparing students sufficiently for future coursework. 
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Table 2: Representative Course-Pair Correlations 


Disc Disc Web Comp Comp Gen 
Struct Lab Prog Neuro Alg_ Bioinf Phys-I 

Disc Struct 1 

Disc Lab 0.94 1 

Web Prog a = 1 

Comp Neuro - = 7 1 

Comp Algs 0.37 0.33 0.41 = 1 

Bioinfor = = 0.79 0.47 0.24 1 

Gen Phys I — = = 0.81 = 0.19 1 

4. RESULTS 


This section describes the results derived from the course- 
pair correlation matrix. Section covers the correlation 
results between individual course pairs, Section covers 
the cliques within the course correlation graph, and Sec- 
tion [4.3] analyzes the course correlation network graphs. 


4.1. Analysis of Course-Correlation Pairs 

The distribution of Pearson course-pair correlations is dis- 
played in Fig. 3. The leftmost bar is due to correlations be- 
tween a course and itself. The top 25% of course-pair have a 
correlation greater than 0.5. The course network correlation 
graphs in Section[4.3] are generated using a threshold of 0.5. 
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Figure 3: Distribution of course-pair correlations 


Table |3} lists course pairs with correlations > 0.75. The 
top three entries cover matching lecture and lab courses, 
which is unsurprising since they cover complementary ma- 
terial. More than 80% of the entries are contained within an 
academic department, although there are interesting inter- 
departmental entries. The link between General Physics I 
and Computational Neuroscience was previously discussed 
and involves mathematical modeling. The link between Gen- 
eral Chemistry Lab IT and Computer Algorithms is not ob- 
vious, but both involve designing and applying a precise se- 
quence of instructions. Philosophy of Human Nature shows 
an interesting connection with Infant and Child Develop- 
ment, potentially establishing a link between Philosophy and 
Psychology. The Philosophy class’s link to Scientific Com- 
puting is more difficult to explain, although it may be related 
to the interdisciplinary nature of Scientific Computing. 


4.2 Clique Results 

A k-clique is a set of k nodes that are each directly connected 
to each other by an edge. Table |4] shows the number of 
cliques of each size in the course correlation network graph 
for correlation thresholds (p) of 0.55, 0.55 and 0.6. The table 


Table 3: High Correlation (p) Course-Pairs 


Course 1 Course 2 p 

Discrete Struct Il Discrete Struct I] Lab 0.96 
Comp Sci IT Comp Sci IIT Lab 0.95 
Comp Sci I Comp Sci I Lab 0.93 
Gen Phys I Comp Neuro 0.81 
Intro Bio I Intro Bio Lab I 0.79 
Web Program Bioinformatics 0.79 
Learning Health Psychology 0.78 
Perception Lab Law and Psychology 0.78 
Gen Chem Lab II Comp Algorithms 0.78 
Phil of Human Nature Infant & Child Devel 0.78 
Phil of Human Nature Scientific Computing 0.77 
Psych & Human Vals Research Methds Lab 0.77 
Law and Psych Clinical Child Psych 0.77 
Biopsych Sens & Percep Lab 0.76 


Intro Robotics DataComm & Networks 0.76 


shows that increasing the correlation threshold even slightly 
dramatically reduces the number of cliques, and hence we 
use 0.5 to retain a clear picture of course network structure. 
Each clique has many sub-cliques (e.g., each 7-clique has 
7 6-cliques and 21 5-cliques), which we view as redundant, 
and hence the table excludes all sub-cliques. Cliques may 
span different course categories or fall entirely within one 
category. Table [5] shows how the cliques from Table [4] are 
distributed across the five course categories using p = 0.5. 
Cliques that do not fall within one category are included in 
the “Span” field. 


Table 4: Number of Cliques as p Threshold Varies 


Clique Size p= 0.5 p = 0.55 p= 0.6 
3-cliques 172 66 29 
4-cliques 51 50 4 
5-cliques 56 2 0 
6-cliques 15 0 0 
7-cliques 4 0 0 
8-cliques 1 0 0 


Table 5: Number of Cliques in Each Category 


Clique Size CS Psych Core Pre-H Span 
3-cliques 46 9 0 0 117 
4-cliques 11 32 0 0 8 
5-cliques 14 39 0 0 3 
6-cliques 0 15 0 0 0 
7-cliques 0 3 0 1 0 
8-cliques 0 1 0 0 0 


Psychology courses form most of the large cliques with size 6 
and greater. Psychology courses are more grouped together 
than Computer Science courses, which have many smaller- 
sized cliques. The 7 pre-health courses form a single clique, 
which suggests that performance in these courses is based 
on similar abilities or knowledge. Core courses lack even 
smaller 3 cliques. Despite their shared mission of core lib- 
eral arts training, it appears the differences in subject matter 
prevents similarity in course performance. No large cliques 
span course categories, but when k = 3, spanning cliques 
outnumber the other ones, which suggests that cliques only 
become meaningful at larger sizes. The largest cliques as- 
sociated with the Computer Science, Psychology, and Pre- 
health courses are described in Table [6] of the appendix. 
Most of those cliques cover related courses (e.g., a 5-clique 
in Computer Science covers programming courses). 
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Figure 4: Network graph (all categories). 


4.3 Course Correlation Network Graphs 

The course correlation graphs generated with p = 0.5 were 
supplied to the Gephi social network analysis software [i]. 
Gephi partitions highly connnected nodes into modularity 
classes and assigns each a different color [3]. The size of 
each node is determined by ranking the node’s “betweenness 
centrality,” which is based on how often a node appears on 
shortest paths between all nodes in the network [4]. 


Fig. 4 shows the Gephi network that includes all courses. 
Nodes are labeled with a department abbreviation (“Eng” for 
English and “CS” for Computer Science), and 4-digit course 
number. Course numbers are not informative so our anal- 
ysis refers to courses by title as needed. The figure shows 
a clear partitioning of courses between Computer Science 
(green, right) and Psychology (purple, left), with Pre-health 
courses (dark grey and below Computer Science) clustered 
together and forming a partial bridge between Computer 
Science and Psychology. While individual edges are dif- 
ficult to distinguish, the figure shows that courses within 
a category are much better connected to each other than 
to courses in other categories. First-year core curriculum 
courses English 1102, Theology 1000, and Philosophy 1000 
are very large, indicating their large betweenness-centrality. 
These courses therefore often occur in the shortest paths 
between other courses and act as bridges between parts of 
the network. While these core courses do not have many 
connections, they connect to a diverse set of courses. Phi- 
losophy 1000 is connected to well-connected courses from 
Economics, Psychology, and Computer Science, while. The- 
ology 1000 is connected to classes in Psychology, Pre-health 
(Biology), and Philosophy 1000. These core classes appear 
to be an indirect indicator of performance for classes across 
the university. Both classes introduce and carefully study 
selected core concepts in their respective fields. 


Network graphs focusing on Computer Science courses and 
Psychology courses are provided in the appendix in Fig. 5 
and Fig. 6, respectively. The modularity classes in Fig. 5 
correspond to meaningful subdisciplines of Computer Sci- 
ence: the light-blue modularity class covers Information Sci- 


ence courses like Data Mining (4631); the magenta modu- 
larity class covers programming courses such as CS1 and 
Lab (1600, 1610), CS2 and Lab (2000, 2010), UNIX pro- 
gramming (3130), and Scientific Computing (4750); and the 
orange modularity class covers advanced courses like Algo- 
rithms (4080), Theory of Computation (4090), and Oper- 
ating Systems (3595). The modularity class groupings dif- 
fer from the cliques in Table [6] of the appendix, although 
both group the same programming courses together. Fur- 
thermore, the five largest nodes in the Fig. 5 based on be- 
tweeness centrality (4631 Data Mining, 4615 Data Commu- 
nications, 8598 Computer Organization, 2200 Data Struc- 
tures, and 83800 Web Programming) are well represented in 
the Computer Science cliques in Table For Computer 
Science, high betweenness centrality reflects an abundance 
of both one-step and few-step connections to other courses. 
Within the department, it is known that a student with a 
poor grade in one of these classes will often struggle in the 
major. Most of these classes are designed to hone special- 
ized skills within Computer Science. The key observation 
from the Gephi graph of Psychology courses in Fig. 6 is 
that the Research Methods Lab course is strongly connected 
with other psychology courses, while the introductory survey 
course is very poorly connected. This suggest that classes 
focused on specialized skills are more predictive of perfor- 
mance in advanced classes than a general survey class. 


5. CONCLUSION 


This descriptive data mining study defined an innovative 
notion of course similarity based on student performance, 
and then used this similarity metric to form course network 
graphs. These network graphs were then used to analyze the 
relationship between courses and course groupings. This 
methodology was applied to eight years of undergraduate 
student data at a large university. 


The study established that there are many course pairs for 
which student performance is highly correlated. When re- 
quiring at least 20 common students, 25% of course pairs ex- 
ceed the 0.5 correlation threshold used in this study, and 5% 
of pairs exceed 0.7 correlation. Courses with the highest cor- 
relations are often offered by the same department. In addi- 
tion, multi-course clusters naturally occur, especially within 
subdisciplines of an academic department, such as the pro- 
gramming courses within Computer Science. Course clus- 
ters were identified as cliques and modularity classes within 
the course correlation networks. As an extreme example, 
all pre-health courses formed a single clique. A small num- 
ber of courses with high betweeness centrality were shown 
to link a diverse set of topics—within one discipline or be- 
tween disciplines, and those courses connecting discplines 
were much more likely to introduce specific skills than to 
provide a broad survey of an area. 


This paper also introduced a methodology for generating 
a course grade correlation matrix from student data, and 
included several steps to address confounding factors such 
as differing instructor grading policies. This methodology 
is available to other education researchers through our soft- 
ware and associated documentation 5]. Our work presented 
a new way of looking at course relationships by a novel way 
of measuring similarity. We plan to continue to investigate 
this notion of course similarity and to apply it to a larger 
set of courses. 
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APPENDIX 


Table [6] lists the large cliques associated with Computer 
Science, Psychology, and Pre-health courses. Many of the 
cliques have a common theme. Computer Science’s second 
5-clique includes three internet-focused courses: Web Pro- 
gramming, Client Server Computing, and Data Communica- 
tions, while the third clique is dominated by programming 
courses (Operating Systems is an exception but includes pro- 
gramming projects). Psychology’s 7-clique links classes cov- 
ering complementary and overlapping elements of cognition; 
however, the 8-clique appears to span diverse topics. As 
mentioned earlier, the pre-health clique covers core science 
courses required by medical schools. 


Table 6: Large Cliques in Different Categories 


COMPUTER SCIENCE 
~~ 5-Clique  ~—CS- Clique 

Data Mining Comp Sci II 
Web Programming Web Programming Comp Sci II Lab 
Data Struct. Data Comm. Data Struct. 
Client-server Comp Client-server Comp Operating Systems 


“5-Clique 
Data Mining 


Comp. Org. Comp. Org. Scientific Comput. 
PSYCHOLOGY 

oO  "8-Clique —  —“—iC—~™ 7 

Child Develop. Biopsy. Research Methods 


Learning 
_Aging and Society 
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The Gephi course correlation network graph for the Com- 
puter Science is displayed in Fig. 5. The contents of Fig. 5 
were describe in detail in Section[4.3]and highlighted how the 
different modularity classes correspond to different subdis- 
ciplines within computer science. The Gephi course corre- 
lation network graph for Psychology, which was only briefly 
described in Section [4.3] is displayed in Fig. 6. Meaningful 
subcategories are much harder to identify, but it is notable 
that Research Methods Lab (2010) is most strongly con- 
nected with other psychology courses, indicating a valuable 
skill shared across the category. This contrasts with the 
required introductory survey class, Psychology 1200, which 
has a much lower betweenness centrality. This indicates that 
a class focused on specialized skills is more predictive of per- 
formance in more advanced classes than a general overview 
class. The psychology courses with largest betweenness cen- 
trality are all represented in the cliques in Table [6] The 
top four courses based on betweenness centrality are: 2010 
Research Methods, 2900 Abnormal Psychology, 2800 Person- 
ality, and 2700 Child Development — all courses with special- 
ized foci. As in Computer Science, high betweenness cen- 
trality in Psychology reflects an abundance of both one-step 
and few-step connections to other courses. 
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Figure 6: Psychology network graph. 
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ABSTRACT 


This study uses eight years of undergraduate course enrollment 
data from a major university to form networks of courses based on 
student co-enrollments. The networks are analyzed to identify 
"hub" courses often taken with many other courses. Two notions 
of hubs are considered: one based on raw popularity and another 
on proportional likelihoods of co-enrollment with other courses. 
Network metrics are calculated to describe the course networks. 
Academic departments and high-level academic categories (e.g., 
humanities), are studied for their influence over course groupings. 
The identification of hub courses has practical applications, since 
it can help better predict the impact of changes in course offerings 
and in course popularity, and in the case of interdisciplinary hub 
courses, can be used to increase or decrease interest and enroll- 
ments in specific academic departments and areas. 


Keywords 


Graph mining, network analysis, educational data mining. 


1. INTRODUCTION 


Universities typically offer thousands of different courses across 
dozens of departments. The interrelationships between courses 
that are taken together, especially those in different departments, 
is often not well understood. This paper addresses this deficiency 
by forming course networks, connecting courses often taken by 
the same students. Each course is represented as a node in the 
graph. Several network analyses are pursued. This work also stud- 
ies “hub” courses, defined as network nodes that are connected to 
many other nodes, resulting in a high degree count [2]. This study 
utilizes three popular centrality metrics to identify course hubs 
and compares the results when using each metric. 


Network analyses utilized in this paper have been proven useful to 
other domains. Analysis of social networks like Facebook identify 
hubs corresponding to influencers with an outsized impact on 
other users’ purchasing behaviors [3]. Network analysis metrics 
pursued in the present work have been applied to the World Wide 
Web, particularly for web searches [5, 8, 9]. 


Identifying and analyzing hub courses can provide concrete bene- 
fits. Courses heavily associated with other courses can be used for 
better resource planning, particularly when changes are made in 
the frequency or capacity of such courses. Furthermore, hub 
courses may be adjusted to drive (or diminish) student interest in 
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an area or academic discipline. For example, there is a current 
need for more STEM (Science, Technology, Engineering, and 
Math) professionals. If a hub course is well connected to STEM 
courses, promoting this course may lead to increased STEM en- 
rollments—even if the hub course is not a STEM course. 


The course network analyzed in this study is based on eight years 
of undergraduate student course enrollment data from Fordham 
University. An edge connects two courses if the number of stu- 
dents taking both courses is above a threshold. Two types of 
thresholding mechanisms are considered: (1) a static threshold 
that is the same for all pairs of courses and (2) a dynamic thresh- 
old set to link together only courses taken together relatively 
frequently (i.e., relative to their popularity). We find the dynamic 
threshold shifts hub courses from humanities to STEM disci- 
plines. Also, tighter course groupings are found within STEM and 
looser groupings within the humanities and social sciences. An 
extended version of this paper is available [12]. 


2. DATASET DESCRIPTION 


Our study uses course enrollment data to generate a course-pair 
dataset, which is then used to form the course networks analyzed 
in this paper. This course enrollment data contains eight years of 
undergraduate data from Fordham University, where each record 
corresponds to one student in one course section. Student grades 
are also available and used in two of our other studies, one of 
which analyzes the impact of course sequencing on student 
grades [4], and the other that forms course networks based on the 
correlation of grades between courses, and then analyzes the net- 
works [6]. This later study performs a somewhat similar analysis 
to the one provided in this paper, but with a very different notion 
of course similarity/linkage. 


The course-pair dataset aggregates the course enrollment data to 
the course level and then extracts information about each course 
pair. Each course-pair record includes identifying information 
about two courses and the number of students that took each 
course and both courses (not necessarily at the same time). The 
department associated with each course is mapped to one of the 
six major course categories. The course-pair dataset contains 
78,173 records, which are formed from 1,763 distinct courses. 
The dataset does not contain all possible pairings because pairs 
with fewer than 20 common students are excluded. The course- 
pair dataset, and the network metrics provided later, are generated 
from the course enrollment data using a publicly available Py- 
thon-based software tool developed by our research group [10]. 


3. NETWORK ANALYSIS METRICS 


The course-pair dataset is used to form course networks by view- 
ing each course as a node and connecting nodes that have a suffi- 
cient number of common students. Table 1 provides the network 
analysis metrics used in this paper. The first three, density, diam- 
eter, and average clustering coefficient (ACC) [1], are computed 
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using an entire network or subnetwork. Our data shows subnet- 
works of courses within single departments have a higher density, 
smaller diameter, and higher average clustering coefficient than 
the network based on all undergraduate courses, because courses 
within a discipline are more tightly connected (see Table 2). The 
last three metrics are defined for each node in the network and can 
be used to help identify hubs. These metrics consist of three cen- 
trality measures: degree centrality, eigenvector centrality [11], and 
betweenness centrality [7]. Each measure can be used to identify a 
different type of hub course. 


Table 1. Summary of network analysis metrics 


Metric Summary Description Range 

Density Fraction of possible edges present. 0-1 

‘Didmesse Maximum distance between any pair 7 
of nodes in network. 

Ave. Clustering —_ Fraction of pairs of neighbor nodes 0-1 
OCU pene that are connected to each other, 
Degree Centrality Number of edges to node (degree). Z* 

Eigenvector cen- Based on centrality of node’s neigh- 

: =0 
trality bors. 
Betweenness Measure all shortest paths passing >0 


centrality through node. 


4. EDGE INCLUSION METHODOLOGY 


To form a course network, each course is represented by a node, 
and an edge is added between two nodes if the courses, across all 
sections, have enough common students. Static and dynamic 
thresholds specify a minimum number of common students. 


The static threshold is based on the number of common students 
between two courses, independent of how many students take 
each course. The distribution of common students by course pair 
is provided in Figure 1 in the appendix. Most course-pairs have 
very few common students, since few students take upper-level 
courses in disparate disciplines. A threshold of 20 students 
maintains 11% of all course-pairs with at least one student in 
common, and this is the static threshold utilized in this study. The 
static threshold is heavily biased towards popular courses, taken 
very frequently, even if only a few students in the popular course 
take specific other courses. 


We also define a dynamic threshold relying primarily on the co- 
occurrence rate of courses. The dynamic threshold is determined 
by multiplying the co-occurrence threshold rate k by the number 
of students in the larger course within each course-pair. To ensure 
a minimum number of common students, a static threshold of 20 
students is used as the floor for the dynamic threshold. The dy- 
namic threshold, d-thresh, associated with two courses, C: and C2, 
is provided in Equation 1, where Cx.students represents the num- 
ber of students who have taken class Cx. 


d-thresh(C1, C2) = max(20, k x max(C1.students, C2.students)) [1] 


The dynamic threshold is heavily dependent on the co-occurrence 
rate k, defined as the number of common students divided by the 
number of students in the larger course. The co-occurrence rate 
distribution is displayed in Figure 2 of the appendix, which shows 
that a co-occurrence rate threshold k = 0.017 discards 39% of the 
edges that satisfy the static threshold. This threshold is used be- 
cause it leads to the most stable centrality measures while exclud- 
ing the fewest number of edges. Table 5 in the appendix shows 
how this dynamic threshold impacts an Art History course. 


5. RESULTS 


This section analyzes course networks using the metrics presented 
in Table 1 and through the identification of hub courses. Static 
and dynamic thresholds are considered. Hub results are analyzed 
within academic departments and broader course categories. This 
study utilizes six course categories: Arts, Communication and 
Media Studies, Humanities, Modern Languages, Social Sciences, 
and STEM. The mapping from academic department to course 
category is partially provided in Table 2. 


5.1 Network Metric Results 


Table 2 presents the values of the previously defined network 
metrics for the course network and subnetworks at the department 
and category levels. Course categories are denoted in bold, with a 
subset of two selected associated departments listed below it 
(see [12] for the full table). The category level value reflects the 
median values across the member departments. The first row of 
data provides the values over all courses in the course network. 
The color of the cells reflects the magnitude of the cell value, with 
red (green) used for the highest (lowest) values. The colors for the 
departments and categories are determined independently. 


The network covering all courses has a high diameter and low 
density compared to the subnetworks, since it includes many di- 
verse courses that are loosely connected. Courses associated with 
a specific department are typically associated with a major; stu- 
dents within the major will take many of these courses. The dy- 
namic threshold decreases the density, average clustering coeffi- 
cient, and number of edges, while increasing the diameter. 


Study of departmental subnetworks shows dynamic thresholding 
most dramatically decreases edges for Philosophy (52% decrease), 
English (44% decrease), and Theology (35% decrease), which are 
fields of study that include many core curriculum courses. This 
drop is mirrored by ACC. Conversely, the diameter maintains 
similar values for most departments, regardless of threshold. 
Overall, dynamic thresholding has a substantial impact on density 
and ACC of Humanities and Social Science courses, and only 
minimal impact on other categories, likely reflecting the core 
curriculum’s emphasis on humanities and social science courses. 


The STEM courses have much higher density and form much 
more dense clusters (based on ACC) than humanities courses, for 
both thresholds. This indicates that humanities students are less 
likely to take the same group of courses in their discipline. In our 
university, humanities majors have fewer required courses than 
STEM majors. Humanities departments have the highest number 
of nodes (distinct courses taken), closely followed by Social Sci- 
ence, suggesting that those disciplines allow more flexibility in 
course choices. The Modern Languages category also has a rela- 
tively high density and ACC. Language courses, like science 
courses, typically rely on prerequisite course requirements for 
proper student preparation. 


5.2 Hub Analysis 


Hubs play a special role in network structures and play an im- 
portant role in understanding and utilizing the information in 
course co-enrollment networks. Table 3 identifies the top-17 hubs 
using the median of the ranks of the three centrality metrics, 
“Combined Rank”. The top half of the table provides the top-7 
hubs when using the static threshold, while the bottom half pro- 
vides the top-7 for the dynamic threshold. Note that the best com- 
bined rank when using the dynamic threshold is 3—no course 
consistently ranks above third on all the centrality metrics. While 
only the combined rank for the static (dynamic) threshold is used 
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Table 2. Summary course network statistics based on category and selected departments 


Category/ Department 
Edges Density 
ALL 39968 0.03 
Arts 
Dance 
Music 


Comm and Media Studies 
Comm and Media Studies 
New Media & Digital Design 
Humanities 
African & African Amer Studies 
English 


Modern Languages 
Greek 
Spanish 
STEM 


Biological Sciences 


Physics 


Social Science 
Economics 


Sociology 


to select the entries in the top (bottom) half of the table, both 
combined ranks are provided to help compare differences between 
the thresholding mechanisms. Courses exhibit very different ranks 
for the two thresholds. 


The first few entries for the static threshold in Table 3 vary only 
slightly depending on which of the three centrality metrics is used. 
The first four entries cover core curriculum requirements that can 
only be satisfied by a single course. Most of the remaining top 
hub courses also satisfy a core requirement, but can be satisfied by 
several courses. The very few STEM courses listed are introducto- 
ry and satisfy a core requirement (e.g., Finite Mathematics). Thus, 
we see that hubs identified using the static threshold are based on 
raw popularity. Most courses identified using the dynamic thresh- 
old also satisfy a core requirement, but often many courses can 
satisfy the requirement. There are no courses that appear in the 
top-7 lists for both thresholds. For static threshold hubs, most 
connections to other courses may be incidental, due to so many 
students taking the popular course. 


Table 3. Top-7 static and dynamic course hubs 
Combined Rank Centrality Rank 


Courses Static Dyn. Deg. Btw. Eig. 
Static Threshold: Top Hubs 
Philosophical Ethics 1 45 1 1 2 
Faith & Critical Reason 2 76 2. 2 1 
Philos. of Human Nature 3 715 3 3 3 
Composition II 4 78 4 5 4 
Banned Books 5 49 5 4 5 
Finite Mathematics 7 56 6 7 5 
Spanish LangandLit 7 BF oie Fo i ie 
Dynamic Threshold: Top Hubs 
Biopsychology 31 3 3 20 3 
Phys. Sci.: Today's World 30 4 4 2 63 
Latin American History 44 | 2: 6 > 
Intro World Art History 22 5 es} 26 2 
Intro Phys. Anthropol. 41 6 1 9 6 
Intro Cultural Anthropol. 18 6 6 33 1 
Films of Moral Struggle 55 8 8 7 54 


Static Threshold 


Dynamic Threshold 


Diam. ACC Edges Density Diam. ACC 
4 24323 0.02 6 


Table 3 allows further comparisons among the centrality metrics. 
When using the static threshold, course ranks are quite consistent 
across all the centrality metrics. This ensures that the combined 
rank is also highly correlated with each of the individual metrics, 
and that the degree centrality is usually equal to the combined 
rank. This correlation is weaker when examining the dynamic 
threshold; degree centrality sometimes differs substantially from 
the combined dynamic rank; Calculus IT has degree 7 and com- 
bined rank 19. Nonetheless, degree centrality is still generally 
close to combined rank and is identical in 5 of the first 7 cases. 


We focus on degree centrality as our metric for identifying hubs 
under both thresholds. This is attractive since degree centrality is 
the simplest and most common metric for identifying hubs. We 
utilize a degree count threshold of 200 to identify hub courses. 
This retains all entries in Table 3, which have degree count of at 
least 245 [12]; the underlying data ensures that a degree count of 
200 will retain the top fifty courses associated with each metric). 


Table 4 shows the distribution of hub edges between the six 
course categories using a degree centrality threshold of 200, help- 
ing to consider connections across categories. The table displays 
the percentage of total hub edges from one category (row) to both 
hub and non-hub courses in another category (column), for each 
threshold. The percentage of total edges, as well as the actual 
number of edges, associated with each category (row), are also 
provided. A color scale is applied to the rows to highlight where 
the hub connections are directed (red is high percentage and green 
low percentage). For example, the first row indicates that, using a 
static threshold, 5% of all Arts hub courses are connected to other 
Arts courses and 14% are connected to Communication courses. 
Furthermore, Arts courses have 1,520 edges, comprising 5% of all 
edges in the course network. 


Table 4 shows that for both thresholds, Humanities, STEM, and 
Social Sciences have the most hub edges, while Arts, Communi- 
cations, and Modern Language have many fewer. Notably, the 
static threshold associates more edges with humanities courses 
than STEM courses (35% to 27%), whereas the dynamic threshold 
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Table 4. Percent distribution of hub edge linkage by course category (hubs with degree >200) with edge info 


Static threshold 
Category Ait 


Arts 
Comm 
Hum 
Lang 
STEM 
SocSci. 


reverses this trend (19% humanities to 39% STEM). Most core 
curriculum requirements are associated with humanities and the 
dynamic threshold has an outsized impact removing courses that 
are hubs simply due to their popularity. 


It is especially notable that more edges link humanities to STEM 
courses than to other humanities courses. Examining the under- 
lying data, we find that the humanities courses Introduction to 
Cultural Anthropology, Introduction to Physical Anthropology, 
and Introduction to Art History all connect to STEM hub cours- 
es. Similarly, most connections for courses in the Anthropology 
and Art History departments go towards the Biological Sciences 
and Natural Science departments. While Introduction to Physi- 
cal Anthropology is part of the Natural Science major require- 
ment, it also satisfies a science core curriculum requirement for 
non-Science majors. It is interesting to observe this course’s 
popularity with Science students. The course is a general survey 
of the biological focus of Anthropology. 


Also notable is that Communications and Social Sciences have 
more links to themselves than to any other category, for both 
static and dynamic thresholds, even though these categories do 
not have as many total links as other categories. The Languages 
category have mostly internal links, and an intermediate number 
of edges overall for both thresholds. 


Social Science hubs in Table 4 have a significant number of 
connections to STEM courses, commensurate with connections 
back towards Social Science. Most of the connections to STEM 
refer to courses in Biological Sciences, particularly from the 
Psychology course Foundations of Psychology. This course is a 
requirement for the Psychology major but is not part of the core 
curriculum. This course also has a significant number of connec- 
tions with the Natural Science department. Overall, the number 
of connections from non-STEM to STEM courses when using 
the dynamic threshold is a bit of a surprise. Conversely, STEM 
hubs made many connections to the Social Science category in 
Table 4; these connections are largely directed towards the Eco- 
nomics department, which requires a strong mathematical base. 


6. CONCLUSIONS 


This study analyzed course network graphs using eight years of 
undergraduate course-grade data from Fordham University. 
General network statistics and course hub statistics were gener- 
ated using a publicly available Python-based tool created by our 
research group [10]. Network structure and hub identity are 
strongly influenced by the definition of edges between courses, 
and whether static or dynamic threshold were applied to course 
co-enrollments. We gain important insights on relations among 
courses, departments, and categories, and on metrics naturally 
applied to characterize these relations. 


All three common network centrality metrics (degree centrality, 
betweenness centrality, and eigenvector centrality) identify a 


Comm Hum Lang STEM SocSci #Edges %Edges 


Dynamic threshold 


Arts Comm Hum Lang STEM _ SocSci #Edges %Edges 


similar set of hub courses using static thresholding to define 
edges. However, the metrics behave much less similarly when 
dynamic thresholding is used, requiring careful consideration in 
future analyses. Nonetheless, degree centrality yields a reasona- 
ble approximation of the other two metrics for both thresholds, 
favoring its future use to study course co-enrollment networks. 


The static and dynamic thresholds yield very different course 
networks and hubs. Static thresholds place more emphasis on 
course popularity, highlighting courses that uniquely satisfy a 
core requirement. The dynamic threshold reduces, but does not 
eliminate, popularity bias. Due to the many mandatory humani- 
ties core courses, and the variety of core options in STEM, the 
dynamic threshold substantially shifted apparent hub focus from 
Humanities to STEM. Future analyses of course relations and 
discipline relations must continue to carefully weigh the influ- 
ence of popularity or the mandatory nature of courses. For both 
thresholds, STEM courses have the highest density and form 
tightly connected clusters, while humanities courses have the 
opposite behavior; this is likely due to the more extensive use of 
prerequisites in STEM disciplines in our university. 


Our analysis also identified large numbers of edges between the 
different course categories. Edge distributions shifted between 
thresholds, favoring humanities for the static threshold and 
STEM for the dynamic threshold. Study of courses forming 
individual edges provided additional insights. The strong con- 
nection between humanities and STEM courses was driven by 
humanities courses like Introduction to Physical Anthropology, 
which has a strong STEM component; the connection between 
social sciences and STEM was driven by courses like Founda- 
tions of Psychology which is linked to STEM courses in Biology 
(Psychology students must take several biology courses). 


This study provides a better understanding of course co- 
enrollment patterns, suggesting directions for valuable practical 
applications. Strong models of co-enrollment patterns can help 
with course planning and ensuring enough of course sections are 
offered. Our course networks reveal valuable details and quanti- 
tative relationships among courses. This work is a foundational 
step in better understanding course co-enrollments. 


There are many ways in which this work can be extended and 
improved. The dynamic threshold could incorporate underlying 
probabilities of each course being taken, so courses are linked 
only where their co-occurrence is much more likely than chance. 
We also can consider additional methods for clustering courses. 
Future analyses may extend to course ordering information. It 
may be useful to reduce the influence of popular departments in 
repeated analysis of category-level network patterns. More fun- 
damentally, our present results may be validated by partitioning 
the underlying student enrollment records into distinct subsets, 
to create training and testing data for our network models. 
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APPENDIX 


Figure 1 shows the distribution of common students by course 
pair (each bin covers a range of common students). The orange 
curve is a cumulative curve that corresponds to the y-axis values 
listed to the right (varying between 0% and 80%) and represents 
the percentage of course-pairs that are maintained for each 
common student threshold value (e.g., a threshold of 20 main- 
tains 11% of all course-pairs with at least one student). 


250000 80% 
—e—: % Maintained Course Pairs 70% 
200000 
60% 
£ 150000 tt 
ns 
a 
5 40% 
3 100000 si 
20% 
50000 
10% 
0 0% 


2 5 ; 10 : 20 60 80 ; 200 400 14000 
Figure 1. Distribution of common students by course-pair 


The dynamic threshold is heavily dependent on the co- 
occurrence rate k. To help set this value appropriately, Figure 2 
shows the distribution of course-pairs for each co-occurrence 
rate, for the course pairs that satisfy the static threshold of 20. 
The co-occurrence rate is the number of common students di- 
vided by the number of students in the course with more stu- 
dents. The co-occurrence rate distribution is heavily skewed to 
the smaller values, just as the number of common students was 
skewed to the smaller values in Figure 1. The bar at the far right 
at x=1.0 is associated with course pairs with the same course in 
both positions and should be ignored. After some experimenta- 
tion we decided on a co-occurrence rate threshold k= 0.017, 
which is the value that leads to the most stable centrality 
measures while excluding the fewest number of edges. The or- 
ange curve, which shows the fraction of edges discarded, indi- 
cates that this value of k discards 39% of the edges that satisfy 
the static threshold. 


——Cumulative Curve _/ 


8 


Number of course-pairs 


Co-Occurrence Rate 


Figure 2. Co-Occurrence Rate Distribution 


To illustrate the dynamic threshold, we apply it to the course Art 
History Seminar, which has 123 students. There are 22 courses 
that share at least 20 students in common with this course, satis- 
fying the static threshold. However, 9 courses have fewer com- 
mon students than the computed dynamic threshold, and hence 
are pruned. Half of these 22 courses are displayed in Table 5, 
and five of these, denoted in bold, are pruned since the number 
of common students is less than the dynamic threshold. As an- 
ticipated, the courses affected by the dynamic threshold have a 
large number of students (third column). In this example, every 
course that satisfies the static threshold, but is pruned by the 
dynamic threshold, fulfills a core curriculum requirement. 


Table 5. Dynamic threshold for Art History seminar course 


Cowised Common Students Dynamic 
Students Course2 ‘Threshold 
Intro Cultural Anthro. 23 2514 43 
Ancient American Art 21 34 20 
17th Century Art 22 47 20 
20th Century Art 43 130 20 
Age of Cathedrals 20 39 20 
Aztec Art 22 61 20 
Composition II 58 12446 211 
Intermediate French II 20 1329 23 
Finite Math 42 4976 85 
Philosophical Ethics 58 11218 191 
Faith & Critical Reason 56 13317 226 
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ABSTRACT 


This study leverages natural language processing to assess 
dimensions of language and discourse in students’ discussion board 
posts and comments within an online learning platform, Math 
Nation. This study focusses on 1,035 students whose aggregated 
posts included more than 100 words. Students’ wall post discourse 
was assessed using two linguistic tools, Coh-Metrix and SEANCE, 
which report linguistic indices related to language sophistication, 
cohesion, and sentiment. A linear model including prior math 
scores (i.e., Mathematics Florida Standards Assessments), grade 
level, semantic overlap (ie., LSA givenness), incidence of 
pronouns, and noun hypernymy accounted for 64.48% of the 
variance for the Algebra I end of course scores (RMSE=13.73). 
Students with stronger course outcomes used more sophisticated 
language, across a wider range of topics, and with less personalized 
language. Overall, this study confirms the contributions of 
language and communication skills over and above prior math 
abilities to performance in mathematics courses such as Algebra. 


Keywords 


Student performance, performance prediction, discussion posts, 
linguistic features 


1. INTRODUCTION 


Discussion boards have emerged to be among the most beneficial 
features of online learning platforms. Some of the positive 
outcomes obtained include greater student involvement and 
improved academic performance [1-5]. Discussion boards have 
been implemented to achieve a number of educational goals, 
namely, to supplement course resources, evoke creativity and 
motivation, facilitate interaction between teachers and learners, and 
for class management or administrative purposes [6-9]. Student 
engagement and collaboration within discussion boards are critical 
towards their success. Indeed, students’ language used within these 
discussion boards has been linked to positive learning outcomes 
[10-12]. This creates a pressing need to further understand the 
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language used by students when collaborating with each other or 
engaging with their teachers within informal online academic 
settings. 


1.1 Language and Math Success 

While empirical evidence shows mixed results in the correlations 
between language proficiency and academic success (i.e., some 
found significant correlations and some none), proponents have 
articulated that language proficiency, and more importantly - 
communicative competence significantly influence success in math 
[13]. The dimensions of language that have been found to 
specifically influence math achievement include linguistic 
complexity, language control, and vocabulary usage [14]. 


A number of studies have demonstrated links between language 
and performance in math [10, 11, 15]. There are strong links 
between language skills and the ability to engage with math 
concepts and problems. For instance, success in math is partially 
based on the development of language that affords children the 
ability to participate in math instruction in the classroom as well as 
“engage quantitatively with the world outside the classroom” [16]. 
Similarly, strong math skills are presumed to interact with language 
ability to understand numbers and symbols [17]. Linguistic skills 
may be one of the key factors that relate to math ability. For 
instance, Cummins identified language difficulties in second 
language speakers as a key obstacle in solving math problems [18]. 
Articulating and representing cognitive processes in math domains 
is especially challenging for students with lower literacy skills. 
Successfully solving verbal analogies and mathematical word 
problems, in particular, demand certain levels of linguistic fluency 
and reading comprehension skills, which can be barriers to success. 


More specific to discourse in online discussion boards, substantial 
work has been done to characterize the language used within online 
discussion forums [19-24]. This research indicates that linguistic 
features distinguish subject matter experts from nonexperts and are 
predictive of student learning outcomes. In social questioning and 
answering sites (e.g., Quora), linguistic features such as word 
usage, average number of words, subjectivity of words, and word 
complexity have been found to be markers of expertise [19]. 
Discourse analyses conducted on online discussion boards show 
that linguistic characteristics are predictive of student learning 
performance [20]. To name a few, the complexity of syntactic 
structures, cohesion, emotion words, modal verbs, and words that 
provide additional information or make claims when elaborating 
are significant predictors of students’ performance [20-24]. 
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1.2 Current Study 

This study examines students’ discussion wall posts for an entire 
academic year within an online Algebra tutoring platform, Math 
Nation, developed by the University of Florida Lastinger Center for 
Learning [25-29]. Math Nation is an interactive and comprehensive 
math teaching and learning platform that provides video tutorials 
and online resources aligned to the Mathematics Florida Standards 
(MAFS). Most relevant to the current study, Math Nation also 
features an online discussion forum called Algebra Wall where 
students can collaborate with other students, teachers, and study 
experts. Wall posts (see Figure | for a sample discussion thread) 
from 3,277 students, Math Nation study experts, and Algebra 
teachers were collected for the period August 1, 2018 to July 31, 
2019, including comments within more than 14,000 threads. 


Our objective in this study was to further examine the extent to 
which the linguistic features of these posts were predictive of End 
of-Course (EOC) Algebra performance, over and above their scores 
Math scores from the previous year. Providing information 
concerning students’ potential EOC performance is important 
because it has the potential to augment stealth assessment of 
students’ abilities such that the instructor or the tutoring system can 
intervene and provide scaffolding when necessary. 


so) clo (-al at Pm Melua mca Zale mCoMerelunle)(-1C-M Ual-Mcve[Vlele-Me)meMe ¥lo(elcel (om -Le Vlei ele an 
xsquared+6x=1 
What video should | watch? | was given the equation, and told to complete the 
square. 

Student 2: So half of the 6 is 3, this means add 9 to each side. Then do you know what to 


do from there?' 

Student 1: Never mind, | found it. 

Student 3: Section 5 topic 7 will help:) 

ES CU Te Va = dol-g ome COLUM NVe alm CoMe (ie (-Maa-Meel-vailoll-lulaelm Mlamalelimmen-lamele(oMenlel@alllulel-lat-ye 0 (elc-Te| 
to each side of the equation. 


Figure 1. Sample Discussion Thread 


2. METHODS 


2.1 Participants 

The participants included 3,277 Algebra students from the different 
Florida school districts in grade levels 7, 8. and 9 who participated 
in the Math Nation discussion board for the academic year August 
1, 2018 to July 31, 2019. The majority of these students were white 
(n = 2,464, 75%). This study focusses on 1,035 students in this 
larger sample whose aggregated posts that included more than 100 
words because NLP indices are not reliable with small language 
samples, and many of our indices (e.g., lexical diversity) require a 
minimum of 100 words [32]. Those who included more words in 
their posts had significantly higher FSA scores, t(3275) = 5.79, 
p<.001 (M100 wordss— 354.08, SD = 16.83; Ms 00 words s— 357.75, SD 
= 16.89); and EoC scores, t(3275) = 8.12, p<.001 (Mé100 words;= 
522.66, SD = 22.99; M>100 wordss= 529.67, SD = 22.93). As such, 
number of words in posts are strong indicators of future and current 
math performance; the purpose of this study is to examine language 
beyond number of words. 


2.1.1 Prior Math and Algebra I EoC Scores 

Students’ mathematics performance was measured using Algebra I 
End-of-Course (EoC) assessment (M=524.88; SD=23.20; Range = 
425-575), which is a high-stakes exam required by the Florida 
Department of Education [30] for high school graduation. 
Mathematics Florida Standards Assessments (FSA) scores 
(M=355.24; SD=16.93; Range = 269-393), from the previous year 
were included as proxy baseline scores indicative of Math 
preparedness [31]. The FSA math scores are often used as a 


baseline measure of Algebra skills or preparedness because they are 
strongly related to the students’ Algebra I EoC scores. Indeed, the 
relation between these two tests was strong in the current study 
(r=0.76, p<0.01). Controlling for gender, grade level and district 
did not result in significant variations in the correlation between the 
Math FSA score and the Algebra I score (i.e., r= .73 - .76). Table 
1 shows the mean scores for both the FSA Math and Algebra I 
exams as a function of grade level and gender. 


Table 1. Algebra performance 


Math (FSA) Score 


from Previous Algebra I 
Year Score 
(Mean / SD) (Mean / SD) 


Grade 7 (n =440) 362.89 (15.74) 539.31 (19.02) 


Grade 8 (n = 520) 355.39 (16.04) 525.58 (20.75) 


Grade 9 (n= 75) 343.92 (17.95) 501.49 (26.49) 


Male (n = 429) 359.92 (16.79) 531.55 (23.25) 


Female (n = 606) 356.21 (16.80) 528.33 (22.62) 


2.2 Natural Language Processing Tools 

We assessed students’ Math Nation Wall discourse using two 
linguistic tools, namely Coh-Metrix [32] and SEANCE [33], which 
report linguistic indices related to language sophistication, 
cohesion, and sentiment. Use of these two tools was motivated by 
prior work relating academic performance in mathematics to these 
features of language in online forums and discussion boards [20- 
24]. 


2.2.1 Coh-Metrix 

Coh-Metrix provides multiple levels of linguistic analysis that 
include indices at word level and sentence level, indices related to 
connections between the sentences, and discourse relationships 
between the texts and their mental representations. Coh-Metrix has 
been used to analyze different forms of text in the English language 
that are written to communicate messages to readers, including 
those within tutoring sessions, chat rooms, email exchanges and 
other forms of informal conversation [34, 35]. In the current study 
the Coh-Metrix indices that estimate psycholinguistic measures, 
word information, syntactic patterns, syntactic complexity, 
situation model, lexical diversity and other descriptive indices were 
used to specifically investigate the linguistic profiles of discourse 
in the Math Nation Wall posts. 


2.2.2 SEANCE 

The Sentiment Analysis and Cognition Engine (SEANCE) 
calculates sentiment indices for a text using pre-developed word 
vectors that measure sentiment and pre-existing sentiment, social 
positioning and cognition dictionaries. One particular advantage of 
SEANCE is that accounts for the presence of negations in the texts 
(e.g., not sad, would not be assessed as negative). Yoo and Kim 
found positive emotions reported by SEANCE to be strong 
predictors of success [35]. SEANCE has also been previously used 
to model math identity and math success [12]. In another study, 
Crossley et al. demonstrated using SEANCE that math 
performance was related to the use of fewer words related to respect 
[11]. Similarly, we used SEANCE in the current study to assess the 
extent to which sentiment expressed within the discussion posts 
was related to math performance. 
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2.3 Data Preprocessing and Feature Selection 
The dataset was checked for multicollinearity as it reduces the 
precision of the estimate coefficients and makes it difficult to assess 
the relative importance of the independent variables in explaining 
the variation caused by the dependent variable. Highly correlated 
features (r >= 0.90) were removed from the analysis. In case two or 
more attributes were found to be highly correlated, the attributes 
with the greater number of pairwise correlations were removed. 
The dataset was further filtered such that features with more than 
20% values were missing and those with zero and nearly zero 
variance were also removed. 


The analysis initially included 124 variables (92 Coh-Metrix 
features, 20 SEANCE Component Scores, and 12 variables related 
to student factors). After preprocessing and feature selection, 12 
linguistic indices, and 4 student variables were included in 
subsequent analyses. The 12 features represent cohesion and 
sentiment measures, whereas the 4 non-linguistic indices represent 
demographics and performance data. 


3. RESULTS 


The purpose of this study was to assess the degree to which 
linguistic features of students’ language within the Math Nation 
Wall posts predicted EOC performance compared to other more 
traditional measures such as demographics and prior math 
performance. To this end, three linear models were examined using 
different combinations of candidate predictors of math 
performance (i.e., non-linguistic features, linguistic features, and 
the combination of both the non-linguistic and the linguistic 
features). The necessary assumptions for testing the regression 
models were met by examining model residuals for all three 
models. Figure 2 provides the sample diagnostic plots for the 
residual analysis of the full model. The residuals versus fitted graph 
reveal no pattern, show a constant variation, and depict linearity. 
The normal Q-Q plot also shows normal distribution of the 
residuals. The remaining 2 plots do not depict any non-linear 
behavior nor any influence of homoscedasticity. These models 
were also validated using 10-fold cross-validation which rendered 
the best fit models in terms of RMSE performance. 


3.1 Non-linguistic Predictive Model 

The non-linguistic features included in this regression model were 
the students’ gender, grade level, and FSA math scores of the 
previous year (see Table 2). Using only the FSA math scores as a 
candidate predictor the resulting model accounted for 58.64% of 
the variance. Using the FSA math score, gender, and grade level, 
grade level also emerged as a significant predictor but gender did 
not. The model with the FSA and grade level as predictors 
accounted for 63.12% of the variance of the EoC scores. No 
significant interactions between grade and gender emerged. 


These results suggest that the Math FSA score depicting prior 
performance in mathematics and the grade level significantly 
contributes to Algebra EoC performance, providing adequate 
proxies for students’ baseline performance prior to the course. 


3.2 Linguistic Predictive Model 

The primary purpose of this study is to examine the degree to which 
features of the language used by students in the wall posts are 
predictive of students’ Algebra EoC scores over and above baseline 
proxies provided by FSA performance and demographic variables. 


We conducted a multiple linear regression analysis predicting 
Algebra I scores using the 12 linguistic indices discussed in Section 
2.3. 
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Figure 2 - Diagnostic Plots for linear regressions 


Table 2. Linear model including Non-linguistic Features 


Estimate S.E. t 
Using Math FSA only as candidate predictor 
Menten 1.040 | 0.027 | 38.30 
Score 
Intercept 157.778 9.721 16.23 
Using Math FSA, grade and gender as candidate 
predictors 
Meniee 0.949 0.027 35.052 
Score 
Grade -8.395 0.745 | -11.262 
Gender * 1.000 0.884 1.131 
Intercept 253.855 12.691 20.003 


Notes: Gender is not significant (p = 0.258); All other p< 0.001. Random- 
effects were estimated with school district (level 1) in a nested mixed-effects 
model, resulting in a negligible amount of variance account by the school 
district (3.67%). Hence, the final models were constructed without district. 


Table 3. Linear model including Linguistic Features 


Estimate S.E. t 
Semantic overlap 
(givenness) of each -75.698 | 15.150 | -4.997 
sentence 
Incidence of Pronouns -0.159 0.024 | -6.732 


Hypernymy for nouns 5.617 0.976 5.758 
527.486 7.343 | 71.838 


Intercept 
Note: p-value at < 0.001 


After the 10-fold cross validation, the best-tuned model accounted 
for 10.64% of the EOC variance with an RMSE = 21.73. Table 3 
reports the coefficients of the significant linguistic. Students whose 
posts include higher noun specificity (hypernymy) also tended to 
have higher Algebra I EoC performance. Moreover, lower degrees 
of sentence givenness and pronoun incidence also emerged as 
indicators of Algebra I EoC performance. These results imply that 
the posts by better performing students were structured in such a 
way that they used more specific terms for concepts or topics 
(higher noun hypernymy), less personal (lower pronoun incidence), 
and included queries or responses with greater amount of 
elaboration on topics that varied across posts (lower sentence 
givenness/newness). 
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3.3 Combined Model 


The combined model included the significantly predictive features 
from both the non-linguistic and linguistic models (i.e., FSA score, 
grade level, LSA givenness, incidence of pronouns, and hypernymy 
of nouns). This model accounted for 64.48% of the variance for 
the Algebra I EoC scores with an RMSE of 13.73. The results are 
summarized in Table 4. 


The findings revealed that the full model with the combined 
linguistic and non-linguistic features performed only slightly better 
than the baseline model in predicting Algebra I scores (i.e., 63.1% 
vs. 64.5% of the variance). An ANOVA was conducted to compare 
the fitness of both regression models, comparing the non-linguistic 
model to the model with the linguistic predictors. The results 
indicated that the more complex model with the additional 
linguistic predictors better captured the variance of the Algebra I 
EoC scores than the baseline model, F = 20.879, p< 0.001. We also 
used Akaike information criterion (AIC) model selection to select 
the best fit model between the non-linguistic model and the model 
with the linguistic predictors. The model with the linguistic 
predictors emerged as the best-fit model carrying 100% of the 
model weight (AICc weight = 1) and having lower AICc (AICc full 
model = 8,357.60; AICc baseline model = 8394.72 ) in predicting 
Algebra I EOC performance. 


These results replicate prior studies [10,15] suggesting students’ 
language fluency and use within Math discussion boards provide 
valuable information regarding students’ potential performance at 
the end of the year. Importantly, these features can be captured 
dynamically as the course progresses, and in the absence of other 
information, such as prior course scores and demographics. 


Table 4. Linear model for Combined Features 


Estimate S.E. t 
Math FSA score 0.902 0.027 32.889 
Grade -8.432 | 0.731 | -11.533 
Hypernymy for nouns 2.919 0.617 4.730 


Semantic overlap 
(givenness) of each 


-27.529 | 9.594 -2.869 


sentence 
Incidence of Pronouns -0.044 0.015 -2.929 
Intercept 264.302 | 13.502 19.575 


Note: all p < 0.001 


4. CONCLUSION 


In summary, the results reported in this study confirm prior studies 
that have suggested that the students’ math course scores, and in 
this case Algebra 1 EoC scores, can be significantly predicted by 
language, in particular hypernymy, pronoun incidence, and lower 
semantic overlap between sentences. Students with stronger course 
outcomes used more sophisticated language, across a wider range 
of topics, and with less personalized language. 


Students’ math scores from the previous year served as a proxy for 
baseline math performance, or prior math skills. As expected, prior 
math skills provided the strongest predictors of the EoC Algebra I 
scores. Students’ grade level also emerged as a significant 
(negative) predictor of the Algebra I EoC performance. The 
students self-select as to when they would take the Algebra I 
course. As such, higher ability students tend to take Algebra I in 


middle school whereas lower ability students tend to take the exam 
later in high school, and thus grade was negatively related to scores. 


Hypernymy (specificity) of nouns, an indicator of language 
fluency, contributed to the prediction of EoC performance such that 
a higher degree of hypernymy or specificity the words used in the 
discourse was related to higher EoC scores. Further, the discourse 
of higher performing students can be characterized as less personal 
as depicted by lower pronoun incidence. In addition, higher 
performing students’ posts had lower overlap between posts, and 
more new information as depicted by the lower givenness/new LSA 
index. 


We assume that students’ engagement in online discussions reveals 
some aspects of their mental representations or understanding of 
the academic content. In turn, the linguistic features of their 
language can serve as proxies for underlying literacy and math 
skills. The linguistic features that pertain to language fluency 
suggest that students’ posts were reflective of their ability to 
communicate more effectively and use terms more specific to the 
academic content. 


Notably, linguistic features depicting sentiment did not emerge as 
significant predictors of Algebra I EoC performance. This could be 
attributed to the academic nature of the discussion such that 
students’ discourse tends to be more domain-related and less 
personal in nature. Yet, there is a strong tendency in the NLP 
literature to focus on sentiment in language. This study indicates 
that when other features related to language sophistication are 
considered, sentiment may not emerge as a significant predictor of 
performance. 


There are multiple implications from this work. The first is 
relatively obvious: literacy and language skills contribute to 
students’ math performance. Language skills aide in student’ 
comprehension of math and their ability to communicate regarding 
math. In turn, they are more likely to succeed. As such, providing 
literacy instruction is important: to enhance students’ performance 
in language courses (ELA), but also for performance in content 
courses (science, history) and mathematics courses. Second, these 
results suggest that it behooves educators to consider literacy and 
communication skills, and provide instructions as concretely and 
coherently as possible [36-38]. 


Third, within the context of online platforms, these results further 
confirm the potential of leveraging linguistic and semantic features 
of students’ posts as indicators of potential course performance. It 
might be assumed that language is not an important indicator of 
math; and yet, multiple studies have demonstrated that the 
linguistic features are powerful proxies for students’ underlying 
skills and knowledge across a variety of contexts. Our future studies 
will consider other features of language (e.g., rhetorical features, 
lexical features) as well as examining students’ language use across 
various times during the course. Whereas this study solely 
examined aggregated posts at the end of the course, our future work 
will examine the number of posts necessary to significantly predict 
performance. Dynamic, online predictions are necessary in order to 
intervene, provide appropriate scaffolding to the students, and 
usable information for the mathematics instructors. Linguistic 
dimensions of the production of online discourse and _ their 
association with academic performance is a promising field of 
research. As such, linguistic profiles of discourse have strong 
potential to inform instructional and pedagogical design of 
collaborative learning environments such as Math Nation. 
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ABSTRACT 

How do students respond to feedback in a reading platform? In 
this study we examined students’ (n = 670) reading and SRL 
behaviors after receiving feedback from their teachers. First, we 
examined the extent in which students revised their responses 
after receiving feedback. Second, we examined the association of 
reading and SRL behaviors with student scores after feedback. 
Third, we examined relationships between the type of feedback 
received (i.e. teacher comments) and subsequent student 
behaviors. We found that students who revised their answers more 
had greater score improvements. Teacher feedback in writing 
conventions was shown to produce fewer reading and SRL 
behaviors when compared to other types of feedback. The number 
of reading events was correlated with improved scores, although 
the effect size was small. These findings suggest that teacher 
feedback can help students employ reading and SRL behaviors 
and improve their reading comprehension under the right 
conditions. We discuss recommendations and possible design 
implications for online reading platforms. 


Keywords 


Self-Regulated Learning, Feedback, Sequence mining, Reading 
comprehension, Natural Language Processing 


1. INTRODUCTION 

Feedback can improve students’ performance [38] and 
Self-Regulated Learning (SRL) behaviors [10]. However, students 
must understand feedback in order to apply it [48]. Feedback gaps 
occur when students receive but do not act upon feedback [24], 
and may be caused by lack of clarity [9], students’ 
misunderstanding of feedback application [55], and the feedback 
paradox [61] (i.e., students do not address feedback despite 
understanding its importance). Researchers have recently 
emphasized the actionability of feedback as one factor to change 
students’ actions and behaviors [12]. This concept remains largely 
underexplored [34]. 


To address the feedback actionability gap, researchers have 
analyzed how students act upon receiving feedback by examining 
students’ perceptions [37, 50] and analyzing student behavior, 
including timely response to feedback [34], the effect of different 
types of feedback on the same question [27], and students’ 
learning strategies usage [43]. 
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We examine students’ feedback response behavior in science 
reading. Science reading skills are of critical importance, but 
challenging for students to master [63]. Science reading can be 
enhanced through the application of SRL skill [15, 47]. To 
investigate SRL and science reading, we conducted our analysis 
on middle school science readings and questions from an online 
learning platform, Actively Learn (AL). We answered three 
research questions: 


RQ1. How do students’ scores vary after receiving feedback? 
RQ1.1. To what extent do students change their answers 
after receiving feedback? 

RQ2. How does students’ reading and SRL usage vary upon 

receiving feedback? 

RQ3. Is feedback type associated with subsequent reading and 

SRL behaviors? 


2. RELATED WORK 
2.1 SRL and Reading 


SRL refers to four regulatory processes during learning: goal 
setting, self-monitoring, self-evaluating, and applying strategies 
[65]. Self-regulated learners use self-monitoring skills to monitor 
their tasks [69] and can judge their learning outcomes in light of 
their goals [68]. Self-regulation is associated with academic 
performance [49, 66]. SRL researchers have proposed theories 
and models (e.g., Pintrich’s SRL framework [49], Zimmerman’s 
Cyclic model [66]) to explain learners’ SRL behaviors. In this 
study, we adopt Winne and Hadwin’s model [56, 58] to measure 
students SRLs from students’ log trace data within AL, as it has 
proven a useful framework for similar research [4]. 


SRL-based reading interventions have been effective in improving 
middle school reading in experimental studies [54]. 
Computer-based learning environments (CBLEs) can integrate 
SRL instruction via features to help students foster SRL skills. 
Examples of CBLEs that are rooted in models of SRL and have 
been shown to support reading comprehension and SRL behaviors 
for reading include iSTART [44-45], nSTUDY [4], and 
ReaderBench [18-19]. In this study, we examine the web-based 
platform Actively Learn (AL), which uses platform features that 
promote SRL (Section 3.1). 


2.2 Sequence Mining 

Sequence mining techniques can identify students’ learning 
behaviors [2, 29]. For example, n-gram sequencing techniques 
have been applied to a game-based learning platform to identify 
students’ problem-solving behavior [2] and to study associations 
between students’ academic performance and transition behavior 
among multiple platforms [29]. 

In this study, we are focused on what SRL activities students 
engage in on the AL platform after they receive feedback on a 
prior submission. In this analysis we applied an approach used by 
Sheshadri et al. [52] to examine sequence behaviors across 
platforms. In this approach we aggregated distinct SRL and 
question submission actions within AL and then examined the 
frequency and sequence of the activities prior to a resubmission. 
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2.3 Feedback 


Providing feedback and opportunities for students to respond to 
feedback is one way teachers can assist their students in reading to 
learn tasks in STEM domains [42]. However, feedback quality can 
influence students' responses [53]. Hattie and Timperley [31] 
characterized feedback at four levels: the task (i.e., how well the 
student accomplished a task), processing (i.e., the processes 
required to complete the task), self-regulation (i.e., how students 
choose and implement self-regulatory strategies to accomplish a 
task); and the self-level (i.e., personal evaluations). Feedback 
effectiveness is also moderated by the amount of information 
provided; different types of feedback should be considered as 
separate constructs [60]. Prior studies have also shown that timely 
engagement with personalized feedback was associated with 
academic success for undergraduate students [34] and may also 
prompt more engaged learning activities when compared to 
general feedback [43]. 


Corrective and self-regulatory feedback given to students after 
answering comprehension questions in response to texts in a 
digital environment can enhance SRL behaviors and performance 
(39, 40]. In an experimental study, students who received 
self-regulatory feedback made more text searches and included 
more textual info in their responses when compared to students 
who received less informative or no feedback [39]. A follow-up 
study replicated these findings and also demonstrated that 
requiring students to select relevant text information before 
re-submitting answers led to improved SRL behaviors [40]. Taken 
together, these studies suggest that corrective and self-regulatory 
feedback can improve SRL behaviors and reading performance 
when students are tasked with re-submitting answers to 
comprehension questions with digital texts. 


However, it can be challenging for teachers to provide timely and 
informative feedback at scale [31, 53]. A prior study on feedback 
comments of science assignments [8] indicated that feedback that 
did not provide a correct answer was only helpful if students knew 
where to find the correct answer; more informative feedback was 
required when students lacked background knowledge Prior 
research also suggests that timely engagement with feedback, 
particularly personalized feedback, was associated with academic 
success for undergraduate students [34]. Written comments can 
provide an effective means for providing feedback on science 
content [8] and in digital reading comprehension tasks [39, 40]. In 
this study we examined how teacher feedback comments within 
the context of a digital science reading comprehension related to 
students' SRL behaviors. 


3; Actively Learn (AL) Platform 


AL is an online K-12 reading platform for multiple disciplinary 
subjects. AL catalogs curriculum-integrated readings that teachers 
can assign as in-class or homework assignments. Teachers can 
also add their own content as assignments. AL assignments 
contain text-embedded questions that can be multiple choice 
(MCQ) and short-answer (SA), including open ended questions 
and fill-in-the blanks. Teachers can give feedback on students’ 
answers to questions by scoring questions on a scale from 0-4 and 
writing comments. 


We adopted Winne and Hadwin’s SRL model in our study. Winne 
and Hadwin’s model has four phases: task defining (Phase 1), 
goal setting (Phase 2), enacting tactics and strategies (Phase 3), 
and metacognitively adopting strategies (Phase 4). We primarily 
focus on students’ usage of SRL tactics/strategies within AL 


(Phase 3) and adapting reading and SRL (Phase 4) upon 
receiving feedback. Our study is grounded in the Winne and 
Hadwin model, as its focus on the events underlying SRL [57] fits 
the retrospective analysis of student interaction data within our 
study. Furthermore, we focus on three types of SRL events that 
are consistent with prior literature situated in Winne and Hadwin’s 
model: annotating [3, 41], highlighting [59], and vocabulary 
lookups [5]. 


3.1 Dataset Preparation 

The present study was conducted with middle school physical 
science data collected from AL in 2018. The initial dataset 
contained 17,886 student records from 1,033 classes. First, after 
data cleaning, we included classes containing 10-60 students (n = 
14,925 students). Second, we identified student submissions on 
which they received feedback. This reduced dataset included 
1,819 unique students, 3,867 questions, and 5,373 submissions. 
Third, we applied the following filtering criteria: 1) a student 
submitted a question multiple times, 2) received at least one 
instance of feedback, and 3) re-submitted after receiving 
feedback. The trimmed dataset, which included student empty 
submissions, contained 670 unique students in 113 classes, 58 
teachers, 156 assignments, 1,072 questions, and 2,502 
submissions. All questions in our dataset are SA questions. 


4. METHODOLOGY 


We describe our methodology for each RQ in this section. 


4.1 RQI1 Methodology 


To answer RQ1 we measured students’ score differences by 
calculating the difference between the first and last submission 
scores when students made multiple attempts after receiving 
feedback. We observed three categories of submissions: score 
increased in the last submission, score decreased in the last 
submission, and score was unchanged in the last submission. 


To assess whether students were addressing teachers’ feedback, 
we calculated similarity between subsequent answer submissions 
of a question. We hypothesized that changes in submitted answers 
would result in a greater score difference in a question. Thus, we 
measured the cosine similarity between subsequent submissions 
of a question. Specifically, we calculated cosine similarities 
between ith and (i-1) th submissions, for i=> 2 attempts and took 
the average. We took all submissions because we wanted to assess 
how students’ changed their answers upon receiving feedback, 
and how those changes impacted their final scores. 


To encode students' responses into vector representations, we used 
the Universal Sentence Encoder (USE) [15]. The USE can take a 
word, a sentence, or a paragraph as inputs and encodes into a 
fixed length vector of 512 values. We then used a Deep 
Averaging Network (DAN) model with USE to encode 
questions and question-dependent texts into vectors [15]. DAN 
averages unigrams and bi-grams of word embedding to construct 
sentence embedding. Moreover, to evaluate how the answer 
modifications were connected to score differences, we calculated 
Spearman correlation between mean cosine similarities and score 
difference. 


4.2 RQ2 Methodology 


To answer RQ2, we coded student actions within the AL system 
as either an answer submission, reading event (R), or SRL event, 
such as annotating (A), highlighting (H), or vocabulary-lookup 
(V). The AL system does not define explicit student sessions. 
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Therefore, we adopted a data-driven approach from prior research 
to define session [36, 52]. First, we aggregated students’ 
assignment actions and timestamps into a unified transaction log. 
We then plotted histograms of two consecutive action sequences 
to estimate the intervals between consecutive actions within an 
assignment. Based on our exploratory analysis we selected 30 
minutes as a “session” cutoff. Any student actions exceeding 30 
minutes were defined as a new session. After defining session 
cutoffs, we split all student actions within an assignment by 
session. Next, we counted SRL events before a_student’s 
resubmission of the question received feedback. 


We then applied a four-level hierarchical linear model (HLM) to 
predict the last score of a question. HLM is commonly used in 
educational research [24, 50] to account for nested data [62]. Our 
HLM model included questions at level-one, assignment ID at 
level-two, student ID at level-three, and teacher ID at level-four. 
Fixed effect variables included students’ first score on questions 
and features of SRL usage during attempting questions. All 
grouping variables were modelled as random intercepts. 


4.3 RQ3 Methodology 


To answer RQ3, we categorized teacher feedback comments using 
deductive analysis, which is a method for analyzing content using 
a predefined model based on prior research [23]. Our deductive 
analysis categories were adapted from Hattie and Timperley’s [31] 
feedback categories, which have been used in prior research [1, 
30]. Our model was also influenced by Shute’s [53] review of 
formative feedback, and Bruno and Santos’ [8] combined 
inductive and deductive coding scheme of teacher written 
comments in a science classroom context for task and 
processing-level feedback. We established five a-priori feedback 
categories using these models. The feedback categories included 
feedback on the task and processing [31] that (i) asked a student 
to either provide a correction to a response or to (ii) provide an 
explanation of a response [8], (iii) self-level feedback, (iv) and 
SRL behaviors. We also created a category for feedback that only 
addressed (v) conventions (e.g., spelling, grammar). 


The SRL behavior category included teacher comments that 
referred to the SRL reading behaviors described in the previous 
section. SRL feedback has been defined as high-information 
feedback about task performance and suggestions for employing 
self-regulation strategies to monitor cognitive processes, 
self-evaluate performance, and strategy development to improve 
performance [31, 60]. We defined SRL feedback more broadly to 
include teacher comments that provided feedback on referring to 
the text to make revisions to an answer, annotating or highlighting 
the text, or to look up a vocabulary term. This definition is more 
appropriate within the context of AL, in which teachers leave 
brief comments on comprehension questions. Prior research has 
also defined SRL feedback in this context as feedback that 
includes knowledge about when to refer to the text [39] and which 
text information is relevant for completing the task [40]. 


Two members of the research team trained on coding comments 
using a sample. All differences in training were resolved by 
discussion. One researcher then coded all teacher feedback 
comments (n = 1,441). A second researcher independently coded 
23% of this sample. Inter-rater reliability (IRR) was calculated 
using Cohen’s kappa and was found to be acceptable (k = 0.74, p 
< 0.001). We then applied a nonparametric Kruskal-Wallis test to 
identify if reading and SRL behaviors varied significantly among 
feedback categories. 


5. RESULTS 


In the following subsections we discuss our results for each RQ. 


5.1 RQI Results 


We calculated the average cosine similarities between subsequent 
submissions (sim_score) and score difference (d) with and without 
empty student submissions. A higher sim_score indicates that the 
submitted answers are more similar to each other. The frequencies 
of six different score difference categories and question counts (n) 
are: -2 (n = 4), -1 (n=53), 0 (n = 187), 1 (n = 474), 2 (n = 252), 3 
(n = 87), and 4 (n = 15). Total questions = 1,072. The Spearman 
correlation test between score difference (d) and mean cosine 
similarities (sim_score) was (coefficient = -0.315, p< 0.001). The 
negative coefficient indicates when the mean cosine similarity 
score decreases, the score difference increases. In other words, the 
more changes are present in students’ subsequent answers, the 
greater the score difference. 

Score Increased Descriptive statistics in this category are: 828 
unique questions, 1,963 submissions by 543 students. First 
attempt score ranged from 0 to 3 with a mean 1.41. Last attempt 
score varied from 1 to 4 with a mean 2.98. We found that the 
positive score change groups have increased by 1, 2, 3, and 4 
points. In these four groups, sim_score has a lower median value 
compared to the rest. This observation indicates that students with 
greater score increases had submissions that differed more than 
their original answers, as represented by a lower sim_score. We 
examined student submissions with identical responses (sim_score 
= 1) but an increase in final score (n = 40) submissions. 

Score Decreased. This group includes 57 unique questions with 
124 submissions by 49 students. First attempt scores ranged from 
1 to 4 with a mean 1.58. Last attempt score varied from 0 to 3 
with a mean of 0.51. 


Score Unchanged. Descriptive statistics for this category are: 187 
unique questions, 415 submissions by 165 students. First and last 
attempt scores have the same statistics in this category. First and 
last attempt scores ranged from 0 to 4 with a mean 1.69. 


5.2 RQ? Results 

Standardized effect sizes were calculated using the formula, 8 = 
(B*SDx)/((SDy) [50]. First attempt score had the highest 
predictive power (B = 0.32, B = 0.28, p < 0.001). Only reading 
was a Statistically significant positive predictor. Highlighting 
behavior was negatively associated with the last score. 


5.3 RQ3 Results 


Feedback comments were coded as requiring either a correction (n 
= 654), explanation (n = 565), a SRL behavior (n = 134), 
addressing conventions (n = 77), and self-level feedback (n = 11). 
SRL events after receiving feedback and before resubmission 
were identified. Kruskal-Wallis test results indicated statistically 
significant differences across the five feedback categories for 
reading (p < 0.001), highlighting (p = 0.007), and vocabulary 
lookup events (p = 0.006). Annotating text was not significant. 


We conducted post-hoc analyses using Dunn’s pairwise tests with 
Benjamini-Hochberg correction for features with statistically 
significant results. Effect size (r) is reported using a 
nonparametric test, Cliff's-Delta. Results indicated that students 
were less likely to engage in reading after conventions feedback 
when compared to SRL behavior feedback (p < 0.001, r = 0.25), 
corrective (p < 0.001, r = 0.28), and explanation feedback (p < 
0.001, r = 0.29). The Kruskal-Wallis test assessed whether the 
group with non-zero entry (i.e., SRL Behavior) was statistically 
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different from the ones with all zero entries. We found statistically 
significant differences between SRL Behavior and corrective 
feedback (p = 0.002, r = 0.009) and explanation feedback (p = 
0.001, r = 0.009). Students were more likely to look up 
vocabulary words after corrective over explanation feedback (p = 
0.003, r= 0.021). 


6. DISCUSSION and CONTRIBUTIONS 


Scholarly Implication: Student Response to Feedback 

RQI results show that students’ who modified their answers had 
greater score differences, which is consistent with prior findings 
on automated feedback [39, 40, 64]. We found that teachers at 
times scored revised responses lower than students’ initial score. 
When examining students’ responses, we found students 
sometimes submitted the identical answer or an empty answer 
(“No response”) despite the teacher asking for explanation or 
suggesting additional correction. This phenomenon in which 
students do not address teacher feedback is known as the 
“feedback gap” [24]. Students might not respond to feedback if 
they find it difficult to decipher [9], lack study habits [20], or 
erroneously believe it does not apply to them [28]. One limitation 
of the present study is that it is not equipped to determine the 
reason for lack of student response. 


Our HLM analysis from RQ2 shows that reading events and initial 
scores were statistically significant predictors of last scores. 
However, SRL variables such as annotation, highlighting, and 
vocabulary lookups were not statistically significant predictors. 
We also found that highlighting was underutilized by students and 
that self-level feedback was not commonly employed by teachers. 


Feedback comments categorized as focusing on correction, 
explanation, and SRL behaviors were associated with more 
reading events during student revisions when compared to 
feedback about conventions. We expected SRL feedback to 
produce more reading events and SRL behaviors than other 
categories based on prior research with automated feedback, 
because these comments directed students to revisit the text to 
revise their answers [39, 40]. However, SRL feedback did not 
produce statistically significant differences in student behaviors 
compared to correction and explanation feedback. One reason for 
this finding might be that these feedback categories had similar 
amounts of information; the level of feedback informativeness 
may have a greater impact on student performance and behavior 
[60]. Corrective (e.g., “Protons cannot be gained or lost”) and 
explanation feedback provided did not explicitly direct students to 
revisit the text or use an AL feature, but perhaps these behaviors 
were implied perhaps these behaviors were implied during a 
task-oriented reading assignment with explanation feedback 
comments such as: “Great definitions but you need to explain why 
phase changes are considered physical changes.”. This might 
explain why vocabulary look-ups were more common in 
corrective and explanation comments when compared to SRL 
(e.g., “Go back and reread paragraph 9 and reanswer. Might help 
you to plug some numbers into the equation to see how the 
inverse relationship works.”) and conventions feedback 
(“Capitalize the first word in a sentence.”). Conversely, perhaps 
the SRL feedback could more effectively influence reading events 
and SRL behaviors if teachers provided more explicit information 
that helped students decide when and how to revisit the text to 
revise an answer [39] or required students to select relevant 
information from the text to support their answer [40]. SRL 
feedback may have directed students to relevant portions of the 


text based on relatively greater highlighting behavior after SRL 
feedback, but this effect size was small, and highlighting was not 
positively related to score change, calling into question the value 
of this behavior. 

For Teachers: Feedback Quality 

Our analysis also showed that teacher comments were generally 
short and contained limited information. It may be possible for 
teachers to improve the quality and effectiveness of their feedback 
by providing more SRL feedback [60, 31], and by avoiding 
self-level feedback and comments about conventions, which were 
shown to not support student performance in the present study. To 
optimize feedback from teacher comments and increase student 
feedback uptake, teachers should support students in 
understanding feedback comments and evaluation criteria [13], 
which may require greater elaboration within comments and 
potentially instruction outside of AL. 

Design Implications: Automated Feedback Affordances 
Feedback can improve performance [38], but poor feedback can 
hinder student learning [31]. Middle school teachers may not have 
enough time to provide quality and timely feedback to all 
students, particularly when providing feedback to open-ended 
questions that require source-based explanations [6]. Although 
automated feedback may assist teachers in providing quality and 
timely feedback without increasing their workload [6, 14], 
challenges remain in building platforms to provide such feedback 
within the context of source-based science questions. For 
example, AL science questions are often constructed responses 
that require connecting information from different paragraphs. 
The state-of-the-art NLP research to automatically infer 
information from paragraphs in reading comprehension is still in 
early stage [22, 35]. One possible design for automated support 
would be to collect other teachers’ feedback on the same question 
in the AL platform and provide suggestions to the teacher. 


Design Implication: Supporting Feedback Actionability 

Some students were not responsive to feedback as indicated by 
their submission of an empty answer or re-submission the same 
answer. One solution to increase actionability could be pointing to 
additional learning materials in an automated feedback setting 
[33]. For example, Broos and colleagues [7] designed a button 
“Okay, what now?” in a dashboard to provide actionable 
feedback. Students could click the button to view extra reading 
content. Similarly, a nudge can be implemented in AL—“Are you 
sure you want to submit that empty answer?” 


7. CONCLUSIONS 


This study has two main contributions to reading and SRL 
research: (i) empirically evaluating students' response changes to 
short answer questions upon receiving feedback and (ii) 
measuring the association of students’ reading and SRL with five 
feedback categories. Our findings show that students who revised 
their answers demonstrated statistically significant differences in 
their scores. We also observed that teachers mainly provided 
corrective feedback followed by explanatory and feedback related 
to SRL behavior. Students exhibited more reading behavior upon 
receiving these types of feedback than convention-related 
feedback. These results may aid educators in writing feedback 
comments to students for maximal impact. 
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ABSTRACT 

Active involvement of new community members is essential 
for Q&A platforms such as Stack Overflow, to make the plat- 
form efficient and more inclusive. However, more than half 
of Stack Overflow users contribute only once and disappear. 
This decreases the diversity of viewpoints and experience on 
the platform. This paper aims to identify factors that can 
discourage users from active participation after their first 
or second post. We collected a dataset of the responses to 
questions posted by new users (answers, comments, upvotes, 
downvotes) and analysed the tone of the feedback and its im- 
pact on the users’ ongoing participation. We considered as 
new users those who registered to Stack Overflow for the 
last two years before the data collection and classified then 
into three groups based on the number of their posts (low, 
medium and high number of posts) on Stack Overflow. The 
differences in the responses between the three user groups 
have been validated by performing one-way ANOVA and 
Pearson’s chi-square test. Based on these results we trained 
a machine learning model using a SVM classifier which pre- 
dicts whether a user is likely to post or not with an accuracy 
of 88.69 %. Our work contributes to identifying and quanti- 
fying the potential underlying factors behind the decline in 
participation and dropout of new users on Stack Overflow. 


Keywords 
Stack Overflow, User Analysis, One-way ANOVA, SVM Clas- 
sifier, User Prediction Model 


1. INTRODUCTION 

Stack Overflow (SO) is one of the most popular Q & A based 
platforms for programmers, having over 50 million monthly 
visitors[7], over 16 million questions[19] and 19 million an- 
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swers[11]. SO has detailed guidelines for posting questions 
and some fundamental standards for providing feedback to 
the questions. Expert users who have been using Stack 
Overflow for a long time are rewarded with badges and repu- 
tation [14]. In order to garner a high reputation on the site, 
the user must be active on a regular basis on the SO plat- 
form and their questions must have many positive responses. 
However, novice users may not have the correct vocabulary 
or expertise to formulate a technical question. As a result, 
they may end up getting negative responses such as “stupid 
question” or “This is such common issue, just google”. Such 
negative responses to posts may discourage users to limit or 
seize their contributions to the platform. This kind of users 
make up for almost half the users of the platform [18]. Un- 
solved questions in SO have seen an exponential growth over 
the years [16] and they continue to be an issue for new users 
who seek help. Studies have shown that online trolling and 
negative responses worsen over the active time of a user in a 
community [8] and 77% of users tend to ask for help on the 
SO platform only once [13]. In this paper, we identify and 
validate the factors which impact the state of participation 
of the new users on Stack Overflow. “New users” are those 
who have been registered to SO for less than 2 years until 
August 2019 (when the dataset was created). We selected 
and analysed five features from the SO post responses to 
understand how getting little to no response and negative 
responses is related to users posting behaviour. Based on 
our findings, we built a machine learning classifier to pre- 
dict posting status of users in SO. 


2. RELATED WORK 


The Stack Overflow or SO platform has turned into a valu- 
able resource for both skilled and amateur programmers for 
glitches, bug and any code related problems. Managing such 
a large user base has been a challenge and an ongoing topic 
for investigation and research from different perspectives. 
Anderson et al. [3] explores the correlation between user rep- 
utation and quality of answers and its impact on the design 
of the site. Asaduzzaman et al. [4] mines the unanswered 
questions in SO to reveal the underlying factors that lead to 
questions remaining unanswered, such as title length, askers’ 
score, post length etc. Alharthi et al. [2] investigates sev- 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 827 


eral factors that impact the quality of questions in SO and 
predicts the score of the question, which indicates its overall 
quality. Similar idea of prediction has been explored in Shao 
et al. [17] developed a prediction model which analyses the 
latent context of a question and recommends an answer for 
the user. Calefato et al. [6] developed a framework based 
on successful questions on SO to provide an evidence -based 
guideline for programmers to write better questions in SO. 
Grant et al. [12] explores the use of badge, to motivate 
users. When a question has better wording and quality, it 
attracts more users and the user gets upvotes which in turn 
helps the score. Adaji et al. [1] investigates specific social 
support strategies that influence users to contribute in SO 
More recent studies have focused on the behavioural and 
personality traits of SO users as well, in order to target 
the emotional aspects of the users who ask questions [15, 
5]. The novice or infrequent users who just started out face 
some level of criticism or neglect by the more experienced 
users on SO, a phenomenon related to maintaining commu- 
nity boundaries by hazing. Hazing is a psycho-social phe- 
nomenon where the newcomers in a tightly knit group face 
backlash and elitist attitude (which is sometimes borderline 
abusive) [9]. Slag et al. [18] discusses the difficulties encoun- 
tered by “one day flies”, users who post only once in their 
profile’s lifetime and do not contribute to the platform after- 
wards. Our work further investigates the effect of the factors 
identified in [18] by providing statistical and empirical vali- 
dation to the hypothesis proposed in [18]. 


3. RESEARCH QUESTIONS AND DATASET 


Since we wanted to compare the responses of the posts, not 
the nature of post itself, we eliminated two of Slag et al. 
[18] factors: duplicate questions and uncommon tags. In- 
stead, to further investigate the features of questions asked 
by such inactive users, we added five new factors: the num- 
ber of upvotes on a question (Up Votes), the downvotes 
(Down Votes), the number of comments on a post (Com- 
ment Count), the reputation of users (Reputation), and the 
types of comments on a post (Comment Texts). We chose to 
add reputation since it affects how a user’s posts is perceived 
by other users. We aim to answer two research questions: 
“1. Do these factors have any quantifiable relation to the 
frequency of posts of users in Stack Overflow?” 

“2. Can we predict whether a user will drop out and stop 
posting ?”. 


3.1 Data Collection 


We collected data from Stack Exchange Data Explorer ', 
an open source tool to collect publicly available data from 
Stack Overflow. We used Stack Exchange Data Explorer to 
collect information about users who created their profile on 
Stack Overflow in 2017. 

For our work, we chose to consider only the questions 
posted by users as their contribution. We collected the num- 
ber of answers, comments, upvotes, downvotes, view count 
given against (received by) each post of a user and the user’s 
reputation. We decided to analyze the mean values of these 
features for each user so that we can consider all of them in 
a normalized form since the distribution of responses is not 
equal for all users. In order to determine the overall tone of 


‘https: //data.stackexchange.com/stackoverflow/query /new 
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Figure 1: Distribution of users population (in thou- 
sands) based on the number of posts they made on 
Stack Overflow 


Table 1: Group based on overall polarity of the com- 
ment 


Group Description 
Highly negative) If -1 <= polarity <-0.5 
Moderately negative) | If -0.5 <= polarity <0 


1 ( 
2 ( 
3 (Neutral) 0 
4 ( 
5 ( 


Moderately positive) | If 0 <polarity <= 0.5 
Highly positive) If 0.5 <polarity <= 1 


a comment, we inferred the polarity of the comments from 
the text of the comments through sentiment analysis on the 
text by using TextBlob on all the comments received by a 
user. Polarity generally falls within the range of -1 to 1 
where -1 refers to a very negative sentiment, 0 refers to a 
neutral sentiment and 1 refers to a very positive sentiment. 
We categorized this range into five different groups as shown 
in Table 1. 


3.2 Target user groups 

The users were categorized into the following groups based 
on the number of questions they posted from January 2017 
to June 2019: group 1 (users who posted a question once), 
group 2 (users who have posted from 2 to 5 times), group 
3 (users who have posted more than 5 times). The distri- 
bution of user population according to the number of posts 
they made in their lifetime on SO in out collected data is 
depicted in the figure 1. We aim to identify such users by 
analyzing their past experiences on SO. In our prediction 
model, described in Section 5, for predicting the future be- 
havior of a user, group 1 is labelled as the negative class 
whereas groups 2 and 3 are combined into a single category 
as the positive class. Therefore, the labeled classes consid- 
ered for this study are: 


1. Negative class: Users who will discontinue making any 
contribution to Stack Overflow after their first post. 


2. Positive class: Users who will continue contributing to 
the platform. 


4. DATA ANALYSIS 
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Figure 2: Distribution of users population (in thou- 
sands) based on the number of posts they made on 
Stack Overflow 


The data analysis of feature selection, validation of metrics 
and prediction model is provided below. 


4.1 Feature Selection 

From the data collected from SO, we performed Pearson 
correlation analysis to find out how strongly each feature is 
related to another. The results of correlation analysis for the 
features are shown in figure 2. Following the Cohen’s classi- 
fication system [10], only the largest relationships i.e. where 
the correlation coefficient r>0.5, have been considered to be 
significantly correlated. From figure 2, it is evident that 
the correlation coefficient of comment count, answer count, 
downvote, upvote, and polarity are significant i.e. greater 
than 0.5. Therefore, for our feature analysis, these five fea- 
tures are selected as the final metrics for next stage. 


4.2 Feature Distribution in User Groups 

To answer the research question: Do these features have 
any quantifiable relation to the frequency of posts of users 
in Stack Overflow?, we analyzed their statistical differences 
among the three user groups. We used a one-way ANOVA 
test and Pearson’s chi-square test to establish the statistical 
evidence of the differences in terms of the features among 
the three user groups. 

All of the five features are plotted against the number of 
posts from users. And the plotting is done for each of the 
three target user groups to observe the difference of plots in 
each groups. The sections below provide in-depth descrip- 
tion of each of the features on all three user groups. 


4.2.1 Number of Answers against Number of Posts 
Figure 3 depicts the distribution of average number of an- 
swers against the number of posts from each user from the 
target group of users on SO. The mean number of answers 
among three groups are: 1.10 (group 1), 3.31 (group 2) and 
15.70 (group 3), which indicates that users in group 3 receive 
significantly more responses to their posts compared to users 
in groups l and 2. The p-value in one-way ANOVA test indi- 
cates significant statistical difference among the three groups 
in terms of the mean number of answers they receive against 
their posts (F(2,375196) = 678.8, p = .000). 


4.2.2. Number of Comments against Number of Posts 


The mean number of comments among three groups are: 
2.22 (group 1), 6.60 (group 2) and 30.00 (group 3), which 


indicates that the mean number of comments significantly 
increases with the increasing number of posts in each group. 
Moreover, it is also evident from figure 4 that users who 
posted less (no more than five times) exhibited a higher 
tendency of receiving no comments from other users. The 
result of one-way ANOVA test indicates significant statis- 
tical difference between the groups than within the groups 
in terms of number of comments they receive against their 
posts (F(2,375196) = 187.3, p = .000). 


4.2.3 Number of Upvotes against Number of Posts 
The graphs show the relation between the average number 
of upvotes with the number of posts in Figure 5. The mean 
upvotes in group 1 and 2 are significantly lower than group 
3 (0.696, 1.853 and 8.877 respectively), which indicates that 
the posts made by the users of group 3 are more appreciated 
and receive higher number of upvotes than the posts made 
by the users who are less active. One-way ANOVA result 
indicates significant statistical difference among the three 
groups in terms of number of upvotes they receive against 
their posts (F(2,375196) = 17.15, p = .000) 


4.2.4 Number of Downvotes against Number of Posts 

From figure 6, it can be observed that the mean number of 
downvotes in group 1 and group 2 (0.548 and 1.309 respec- 
tively) are lower than that of group 3 (4.17). This means 
that the users who are posting more questions are also get- 
ting fewer downvotes. This is an important and surprising 
observation since a higher number of posts could have also 
led to increased number of downvotes, which turns out to 
not be the case. One-way ANOVA result indicates signif- 
icant statistical difference between the groups than within 
the groups in terms of number of comments they receive 
against their posts (F(2,375196) = 473.04, p = .000). 


4.2.5 Comment Polarity against Number of Posts 


Table 2: Percentage of each comment polarity cate- 
gory received by the user groups 


User Group 
Polarity Group 1 2 3 
1 (Highly negative) 0.2% | 0.1% | 0% 
2 (Moderately negative) | 20.1% | 20.7% | 13.6% 
3 (Neutral) 24.4% | 11.7% | 1.3% 
4 (Moderately positive) | 48.4% | 67.2% | 85.1% 
5 (Highly positive) 11% | 0.3% | 0% 


The result of cross tabulation in Table 2 revealed that users 
from group 3 received zero highly negative comments, and 
the least proportion of moderately negative comments. They 
also received the highest proportion of moderately positive 
comments (85.1%). On the other hand, the users of group 
1 received the lowest amount of moderately positive com- 
ments(49.5%) among the user groups. It can be concluded 
from Table 2 that with the increase in the number of posts 
made by the users, there is an increase in the positive com- 
ments and decline in the negative remarks received by the 
post owners. Lastly, the result of Pearson’s chi-square test 
establishes the statistically significant relationship between 
the polarity of comments and the user groups (?(8, 297447) = 
20310.29, p = 0.000). 
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Figure 3: Distribution of average number of answers against number of posts among three user groups 
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Figure 4: Distribution of average number of comments against number of posts among three user groups 


Name Measure 
Mean answer Scale 
Mean comment Scale 
Features Mean upvote Scale 
Mean downvote Scale 
Mean polarity Scale 
Target User Class Nominal 


Table 3: Attributes of the dataset employed in the 
SVM classifier 


Recall 
0.7635 


Precision 
0.9875 


Classifier | Accuracy 
SVM 0.8869 


Table 4: Prediction model performance per evalua- 
tion metric 


5. PREDICTING USER PARTICIPATION 


Based on the findings from correlation analysis and statis- 
tical testing of factors influencing users post frequency, we 
developed an actual prediction model to answer the second 
research question. In order to develop and train our model, 
we took advantage of a popular supervised machine learn- 
ing algorithm called support vector machines (SVM). Table 
3 describes the features and target groups employed in our 
model. We divided the original data set into training set rep- 
resenting 80% of the data and testing set representing the 
remaining data. By using the five features, we divide our 
users into two classes: likely to post and not likely to post. 
The performance of our prediction model i.e. how well it 
predicts the user class is evaluated using the metrics: accu- 
racy, precision and recall. Our model performs significantly 
well and yields a high score in terms of all three metrics as 
illustrated by Table 4. 


6. IMPLICATIONS AND FUTURE WORK 


From our data analysis, we observed that a low number of 
answers and comments, a high number of downvotes and 
negative comments, and a low number of upvotes are more 


prevalent in the posts of users who have posted fewer times 
compared to the users who have higher number of posts. 
Since these users are receiving negative remarks and down- 
votes even with fewer posts, this may play a role in discour- 
aging them from seeking help again from SO. In the light of 
this discovery, we trained a SVM classifier model with the 
five special features and divided the users into two classes: 
users who will post in the future and users who will not. 
The model has shown good performance with high accuracy 
and effectiveness. 

Previous works which mostly focused on how users can ask 
better questions or build a better profile to attract more 
answers to their questions. The novelty of our study is to 
identify infrequent users and find a possible factor underly- 
ing their withdrawal, so that the community owner/ moder- 
ator can make the platform more welcoming and less hostile 
for them. Our study has some limitations as well. We could 
not consider the number of deleted questions of a user as one 
of the factors that could contribute to users’ decline in posts 
since Stack Exchange Data Explorer does not provide that 
data. The research also lacks a qualitative analysis from 
feedback of infrequent or absent users. Therefore, as part 
of our our future plan, we will attempt to explore the user 
modelling of infrequent posting through a targeted qualita- 
tive user study of SO users. 


7. CONCLUSIONS 

More than half of the users in Stack Overflow tend to ask for 
help on the platform only once and never post again. In this 
paper, we identified five main features / metrics which we 
hypothesized to be related to the inactive status of users. 
We collected the responses to posts in SO for users who 
have their SO profiles for 2 years (2017 to 2019) and se- 
lected five factors with strong correlation. Ours statistical 
analysis supports our hypotheses and validates the effect of 
these factors having a significant correspondence to users’ 
posting frequency. Using these factors as selected features, 
we trained a machine learning model that predicts whether 
or not a user will post in the Stack Overflow platform, based 
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Figure 6: Distribution of average number of downvotes against number of posts among three user groups 


on the responses their posts received till now. This predic- 
tion can identify users who have reduced their posting in SO 
and face lack of encouragement and thus can benefit from a 
positive nudge, help or mentorship. The significance of the 
contribution of our analysis and prediction model is that it 
can help to provide more equitable treatment of newcomers, 
and thus increase the diversity of the SO community. 
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ABSTRACT 


While the use of programming problems on exams is a com- 
mon form of summative assessment in CS courses, grad- 
ing such exam problems can be a difficult and inconsis- 
tent process. Through an analysis of historical grading pat- 
terns we show that inaccurate and inconsistent grading of 
free-response programming problems is widespread in CS1 
courses. These inconsistencies necessitate the development 
of methods to ensure more fairer and more accurate grad- 
ing. In subsequent analysis of this historical exam data we 
demonstrate that graders are able to more accurately assign 
a score to a student submission when they have previously 
seen another submission similar to it. As a result, we hy- 
pothesize that we can improve exam grading accuracy by 
ensuring that each submission that a grader sees is similar 
to at least one submission they have previously seen. We 
propose several algorithms for (1) assigning student submis- 
sions to graders, and (2) ordering submissions to maximize 
the probability that a grader has previously seen a similar 
solution, leveraging distributed representations of student 
code in order to measure similarity between submissions. 
Finally, we demonstrate in simulation that these algorithms 
achieve higher grading accuracy than the current standard 
random assignment process used for grading. 


Keywords 
similarity, code embeddings, embeddings, assessment, grad- 
ing, human, simgrade, grade 


1. INTRODUCTION 


Free-response coding questions are a common component 
of many exams and assessments in programming courses. 
These questions are popular because they give students the 
opportunity to show their understanding of course mate- 
rial and demonstrate their coding and problem-solving skills 
[16]. However, the flexible nature of these problems intro- 
duces unique challenges when it comes to grading student 
responses, which are compounded in situations where the 
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scale of the course necessitates a team of graders working 
together (“group grading”). The difficulty of consistent ap- 
plication of grading criteria by a group of graders stems from 
the incredible diversity of student submissions that are gen- 
erated for free-response coding questions. In particular, it 
has been previously shown that the space of different student 
solutions to free-response programming problems follows a 
long-tailed Zipf distribution [18]. For this reason, it is chal- 
lenging to develop automated systems for grading and pro- 
viding feedback and thus human grading remains the gold 
standard for grading such free-response problems. However, 
even a team of human graders with extensive experience can 
struggle to consistently and accurately apply a single, uni- 
fied criteria when grading. This is problematic as it can 
result in negative impacts on students in the form of incor- 
rectly assigned grades and inaccurate feedback. Our goal in 
this paper to explore the frontier of techniques improving 
the process and outcomes of the exam grading experience. 


Our main insight in developing improved approaches for 
grading is that it is easier for graders to grade in a consis- 
tent manner if they are able to grade similar submissions one 
after another. First, we examine historical data to provide 
concrete evidence of a relationship between grader accuracy 
and the similarity of previously graded submissions to the 
current submissions a grader is grading. Then, we propose 
algorithms that group and order similar submissions in dif- 
ferent ways to minimize grader error. Finally, we show that 
these algorithms perform better than current baseline meth- 
ods for grading. This work’s primary contributions are: 


1. Reporting of grader errors in a CS1 course 


2. Using historical data to demonstrate the potential ben- 
efits of similarity-based grading 


3. Three algorithms for grading using code similarity 


1.1. Related Work 


Autograding One commonly used approach to scale grading 
is the use of autograders (6). While useful for comparing 
program output for correctness or matching short snippets 
of code, autograders are more problematic for free-response 
questions in exam settings. In such contexts, the subtlety of 
understanding that human graders provide is often essential 
to providing appropriate feedback to students and properly 
assessing the (partial) correctness of their solutions. While 
promising, fully autonomous AI solutions are not ready for 
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grading CS1 midterms especially for contexts 


with only hundreds of available student submissions [17]. 


Grading by Similarity The idea of grouping and organizing 
student submissions in order to improve grading outcomes 
has been previously proposed for a variety of problem types. 
Merceron and Yacef [9| use vectors that encode students’ 
mistakes in order to group together students who make sim- 
ilar mistakes when working on formal proofs in propositional 
logic. Gradescope, designed by Singh et al. offers func- 
tionality for grading similar solutions, which is currently 
most effective on multiple-choice-type questions. This ap- 
proach has also been applied to short answer questions, as 
explored by Basu et al. 2], as well as math problems, as 
demonstrated by Mathematical Language Processing (8). In 
this paper, we identify ”similar” student responses on free- 
response programming questions to improve grading quality. 


Code Similarity In order to define similarity metrics for stu- 
dent code submissions, we apply techniques for generat- 
ing numerical embeddings for student programs. Henkel 
et al-what created abstracted symbolic traces, a higher- 
level, light-syntax summary of the programs, and embed- 
ded them using the GloVe algorithm [13]. Alon et al. 
pioneered code2vec, an attention-based embedding model 
specifically used to represent code. Recently, further ad- 
vances have been made to improve code embeddings by 
training contextual AI models on large datasets from Github 
[7]. For this application, we favor simpler unsupervised em- 
bedding strategies that do not require human-generated la- 
bels by adapting the popular NLP technique Word2vec [10], 
in which “word” representations are derived from surround- 
ing context. 


1.2 Dataset 


Our analysis focuses on the student submissions and grader 
logs from four exams for an introductory programming (CS1) 
course taught in Python. The breakdown of summary statis- 
tics across the four exams is presented in Table[]] As a note, 
a “submission” is defined as one student’s written answer to 
one free-response problem — thus, the total number of sub- 
missions for a given exam is roughly the number of students 
times the number of coding problems on the exam. In to- 
tal, we analyze 11,171 student submissions across 1,490 stu- 
dents. Additionally, we have grading logs for every student 
submission, which consists of information about the grader, 
the criteria items applied, the final score, and the amount of 
time that the grader spent on the submission. 199 graders 
contributed to grading these four exams. As discussed be- 
low, the same student submission is sometimes graded by 
more than one grader for validation purposes. Thus, our 
dataset contains 14,597 individual grading log entries. 


Our grading data comes from a grading software system 
that randomly distributes student submissions to graders. 
Among the standard student submissions for grading, this 
software also inserts “validation” submissions that have al- 
ready been graded by senior teaching assistants. Every grader 
assigned to a specific problem will grade all “validation” sub- 
missions for that problem. The presence of these special 
submissions creates opportunities for assessing grader per- 
formance, both relative to their peers and relative to “ex- 
pert” performance. 


Exam # # Students # Submissions # Graders 
1 533 3,731 53 
2 259 1,813 52 
3 247 2,470 51 
4 451 3,157 43 
Total 1,490 11,171 199 


Table 1: Exam Grading Dataset Summary Statistics 


2. NATURAL GRADING ERROR 


While anecdotal experience of grading inconsistency is a 
common trend in our experience as educators, our first fo- 
cus is to quantify the inconsistencies present in historical 
grading sessions in a rigorous manner. In particular, our 
analysis focuses on the aforementioned “validation” submis- 
sions that were specially handled by the grading software 
and assigned to every grader working on a specific problem. 
As a result, we had a subset of the grading logs for which 
we knew both the true grade (as defined by an expert) and 
the “validation” grade assigned by each grader. Plotting 
these values against one another is shown in Figure[]] which 
reveals troubling inconsistencies in the grades assigned by 
graders. With an RMSE of 7.5 (i.e., average error of 7.5 
percentage points per problem), we see that grading error is 
significant, nearly on the order of what would translate to 
a full letter grade. Linear regression on this plot yields an 
R-squared coefficient of 0.947 indicating that while the error 
may be high, the direction of errors is generally unbiased. 
In other words, there is not systematic over /under-grading. 
Rather, the grading errors tend to be randomly distributed 
around the true grade. Thus, the rest of this paper focuses 
on methods for decreasing this demonstrated inconsistency 
(absolute error) in human grading. 


Grading Inconsistency (True Grade vs. Validation Grade) 
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Figure 1: True grade assigned by expert vs. validation grade 
assigned by human grader 


3. METHODS 


In this section, we will first outline methods for answering 
key questions about the problem of improving human grad- 
ing using similarity scores. Then, we will present three novel 
algorithms for improving human grading. 


3.1 Can code similarity be accurately captured? 
We generate program embeddings for all student submis- 
sions in our corpus. Word embeddings are an established 
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Figure 2: Submission assignment via three algorithms: Cluster, Snake, Petal 


method of encoding semantics in human language 
3], and these same techniques applied to code accomplish 
similar results. Algorithms for generating embeddings are 
constantly evolving and improving; to avoid over-optimization 
at the embedding generation stage, we chose to employ the 
simple baseline Word2Vec algorithm. We then demonstrated 
that our embeddings are semantically significant using zero- 
shot rubric sampling [ig]. For details, see the Appendix{'| 


3.2 Does similarity influence grader accuracy? 
We hypothesize that graders score submissions more accu- 
rately when they have recently seen a submission similar to 
the current submission. To test this hypothesis, we ana- 
lyze grading data for four exams. First, for each grader, we 
generate a “percentage grading error,” which is an average 
of their absolute percent deviation from the correct answer 
on all validation submissions that they graded. Then, for 
each of the validation submissions that a grader evaluated, 
we sort their personal grading logs by time and look at the 
window of three submissions leading up to each validation 
submission they graded. To quantify similarity of the valida- 
tion submission to recently graded submissions, we take the 
maximum of the cosine similarity between the current vali- 
dation submission and the three previous submissions. We 
plot the maximum similarity between a validation submis- 
sion and the previous submissions against a grader’s per- 
centage grading error in order to identify the relationship 
between a grader’s history and accuracy. Then we can infer 
a formula that approximates the relationship between pre- 
vious submission similarity and percentage grading error. 


3.3 Algorithms to assist human grading 

We compare four algorithms for assigning submissions to 
graders: (1) Random, in which submissions are randomly 
assigned to graders, with five “validation” submissions in- 
terspersed for assessing grader bias. This is the status quo 
and serves as the baseline. (2) Cluster, in which each grader 
is assigned to a cluster of highly similar submissions. (3) 
Snake, in which each grader is randomly assigned a set of 
submissions and is shown the submissions greedily by near- 
est neighbor. (4) Petal, in which the dataset is divided into 
“petals” and all graders begin in the same place. Figure [2] 
provides a visualization of (2), (3), and (4). Detailed expla- 
nations of the algorithms are in the Appendix’. 


https: //compedu. stanford. edu/papers/appendices/ 


Grader Error vs. Prior Submission Similarity 
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Figure 3: Relationship between grader accuracy and similar- 
ity in 3-submission window prior to validation submission 


3.4 Algorithm evaluation 

To evaluate the performance of the different algorithms, we 
simulate grading for a 444-person six-problem exam and ten 
graders, using real student programs from an actual exam. 
Details about the selection of validation submissions are in 
the Appendix'. When running the simulation, we infer per- 
centage grading error by examining the similarity of the pre- 
vious three submissions to the current submission. While 
we emphasize grader error as the most important metric 
for assessing an algorithm, a secondary consideration is how 
naturally validation submissions integrate with the rest of 
a grader’s assigned submissions. Ideally, a validation sub- 
mission is not “out-of-distribution” with respect to the other 
submissions that a grader is assigned. Otherwise, a grader 
will be able to tell when they are being evaluated for grad- 
ing accuracy. To assess how “out-of-distribution” the vali- 
dation submissions are, we examine how dissimilar the vali- 
dation submissions are from the non-validation submissions 
assigned to a grader. Specifically, for each validation submis- 
sion, we measure the distance between the validation sub- 
mission and the nearest non-validation submission assigned 
to that grader. We average over the five validation submis- 
sions in order to get the mean minimum distance from val- 
idation to non-validation for a grader, which will be higher 
if one of the validation submissions is out-of-distribution. 


4. EXPERIMENTAL RESULTS 


4.1 Similarity scores are meaningful 
Embeddings are semantically significant because similarity 
between embeddings corresponds to similarity between sub- 
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Figure 4: Left: Average per-submission grading error for each algorithm, Center: Distance of validation submissions from 
normally assigned submissions, Right: Summary performance statistics, including comparison to random baseline. 


mission feedback labels, as described in the Appendix?. 


4.2 Similarity influences grading 

Graders score assignments more accurately when they have 
recently seen a submission similar to the current submission 
they are grading. From our analysis of historical data, we 
find that when there is a high similarity between the cur- 
rent submission and at least one of the previous three sub- 
missions, the percentage grading error is low. Conversely, 
when the similarity between previous submissions is low, the 
percentage grading error is high. We find a linear relation- 
ship between the maximum similarity of the previous three 
submissions and the percentage grading error as shown in 
Fig. with R? = 0.605. Given that the grading process 
involves the numerous uncertainties that come along with 
human involvement, we believe this correlation coefficient 
shows a statistically significant relationship between histori- 
cal similarity and grader accuracy. While the linear relation- 
ship between historical submission similarity and percentage 
grading error is a simplifying assumption, it is the best as- 
sumption we can make given evidence provided in Fig. 


4.3 Improved accuracy by algorithm 

We compare six algorithms for assigning submissions to graders 
and selecting an order in which a grader will view a submis- 
sion in Figure [4] We apply the equation of the linear rela- 
tionship shown in Figure[3|to the similarity of submissions as 
ordered for evaluation by different algorithms in our exper- 
iments. This equation allows us to predict grader accuracy 
when using the orderings provided by different algorithms. 
We find that implementing a path ordering on a clustered 
assignment of graders to submissions yields the lowest mean 
error of 2.7% (bold-ed in Fig. (4), while the other algorithms 
all show an improvement over the baseline 10.2% grading 
error. We utilize bootstrapping |4| over 100,000 trials in or- 
der to get the p-values that indicate the significance of the 
difference in means between the baseline algorithm and the 
other algorithms (see table in Fig. [4p. 


4.4 Validation viability by algorithm 

When comparing the cluster, snake, and petal algorithms, 
we observe that the cluster-based algorithms are most likely 
to have validation submissions that are “out-of-distribution,” 
with a mean validation distance of 0.0277. All other algo- 
rithms have substantially lower mean minimum distances. 
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5. DISCUSSION 


Overall, we saw that all of our novel proposed algorithms 
for assignment of submissions to graders provided improve- 
ments over the random baseline in simulation. In general, 
we saw that path-based algorithms (petal-path and cluster- 
path) had lower grading error than their non-path counter- 
parts because they are designed to optimize for maximum 
similarity between consecutive submissions that a grader 
grades. In particular, the cluster-path algorithm yielded the 
lowest grader error in simulation due to its strong tendency 
to assign very similar submissions to graders. On the other 
hand, the snake algorithm provided the most optimal aver- 
age distance to validation submissions, which may be impor- 
tant for a smooth experience for a real-life grader. Finally, 
we saw that the petal algorithm offered a balanced trade-off 
between these two extremes — while not optimal in either 
metric, it can be a good choice when both metrics (grading 
error and validation submission distance) are equally impor- 
tant for designing a grading experience. For a more in-depth 
discussion of our observed results, see the Appendix’. 


6. CONCLUSION 


Through analysis of historical exams, we demonstrated that 
there is inconsistency between true scores and grader-assigned 
scores. In doing so, we introduce a new task and associated 
measure, grading correctness. Moreover, we found experi- 
mental support for our hypothesis that graders are able to 
assign scores to exam problems more accurately when they 
have previously seen similar submissions. In turn, we pro- 
posed the use of code embeddings to capture semantic in- 
formation about the structure and output of programs and 
identify similarity between submissions. Using similarity 
of code embeddings in conjunction with historical grading 
data, we demonstrate in simulation that graders are indeed 
able to score submissions more accurately when they have 
previously seen another submission similar to it. We propose 
and compare several algorithms for this task, showing that it 
is possible to achieve a significant increase in grading accu- 
racy over simple random assignment of submissions. Future 
extensions of this work include (i) improvements on code 
embeddings and (ii) deployment of the grading algorithms 
in an operational system to allow more direct experimental 
comparison of grading accuracy. The use of such algorithms 
show promise for improving accuracy, and in turn fairness, 
in evaluations of student performance. 
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ABSTRACT 


In distance education and some computer-assisted learning 
scenarios asking for help when needed is important. Some 
students do not ask for help even when they do not know 
how to proceed. In situations where a teacher is not present, 
this can be a serious setback. We aim to find an approach 
to learn about students’ help-seeking behaviour by studying 
sequences of actions that end with the student asking for 
help. The goal is to be able to recognize those students who 
need help but fail to ask for it and offer them assistance. We 
propose to include the temporal context of user-platform 
interaction and suggest an ensemble model to learn from 
both general and personal tendencies. 


Categories and Subject Descriptors 

K.3.1 [Computers and Education]: Computer Uses in Edu- 
cation; 1.2.6 [Artificial Intelligence]: Learning; 1.5.4 [Pattern 
recognition]: Applications; G.3 [Probability and Statistics]: 
Markov processes, Time series analysis 


Keywords 
Adaptive systems, time series, educational data mining, per- 
sonalized education 


1. INTRODUCTION 


Researchers have found help-seeking to be important in learn- 
ing scenarios and observed that some students do not reach 
for help when they need it [4, 25]. When a teacher is not 
always present, and the student needs some level of self- 
discipline, not asking for help might be problematic as the 
student could end up wheel-spinning [18] or abandoning the 
task. The longitudinal nature of student-platform interac- 
tions leads us to think that taking into account the temporal 
context could be useful for analysing help-seeking behaviour. 
There is literature on help-seeking including temporal data, 
and some works have focused on performance prediction or 
have centred on specific knowledge topics. However, we have 
not found work that focuses on the behaviour around help- 
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seeking actions, including temporal data and independent of 
student knowledge and task content. In this Master Thesis, 
we propose to represent student-platform interactions as se- 
quences of actions, study whether sequential patterns exist 
in students’ help-seeking behaviour and explore whether a 
prediction model could identify students that need help but 
do not ask for it. 


2. RELATED RESEARCH 


Time series studies are very common in natural sciences and 
some social sciences. Studies that make use of time series 
data can also be found in the field of educational sciences [5, 
8, 9, 13, 14, 15, 17, 29]. We have reviewed existing works on 
both help-seeking behaviour and time series data analysis. 
In section 3 we highlight the specific differences between the 
works exposed here and what we propose to do. 


2.1 Help-seeking behaviour 

Knowing when to ask for help is important [4, 11]. [10, 11] 
agreed that it could improve resilience and efficacy. Accord- 
ing to [10], help-seeking has been studied for years but the 
rise of new technologies opens new research opportunities on 
help-seeking in these new contexts. 


Some works have focused on detecting specific situations 
that are known to be problematic. For instance, in classes 
where the teacher has more students than desired, it might 
be difficult for them to identify students who need help or 
are stuck. [18] developed a method using machine learn- 
ing (ML) models to automatically predict wheel-spinning 
and decide how to intervene. [4] attacked both problems 
of asking for help too much and not enough by negotiat- 
ing with the student. Rather than using ML models, they 
predefined a set of heuristics. A slightly different situation 
was studied by [32]. Their goal was to find a connection 
between student procrastination (i.e. intentionally delaying 
work) and their activities within different learning materials. 
They used data from a massive open online course (MOOC) 
platform and found two main study strategies: students who 
delayed work worked intensively for short periods followed 
by long pauses, while students who did not delay usually 
split the tasks into subtasks and worked more constantly 
but less intensively. 


Other works have focused on knowledge tracing [6, 7, 24], 
however, we will not be considering student knowledge but 
their behaviour and interaction with the educational system. 
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2.2 Pattern recognition and sequence predic- 
tion 

To cluster categorical sequences, one needs to define the 
function or method used to compare the sequences pair- 
wise, that is, how to measure the distance between them. 
[5] analyzed activity frequency through the length of dif- 
ferent online courses to study if different activity patterns 
were related to student performance. They used agglom- 
erative hierarchical clustering (AHC) with the Levenshtein 
distance. [17] used a similar approach and found patterns in 
group problem-solving strategies analysing group behaviour 
of students working on interactive tabletops. [8] also used 
AHC with the Levenshtein distance and found 3 groups of 
similar study state sequences using data from a drill-and- 
practice learning environment in college mathematics. 


[13] found that a large subgroup of MOOC participants 
might have been engaging by watching video lectures with- 
out doing the assignments. Their methodology consisted in 
constructing, for each student, vectors of states represent- 
ing their engagement trajectories through the course. They 
computed the distance between trajectories by assigning a 
numerical value to each label and calculating the L1 norm. 


[14] proposed a method that would capture the clusters’ 
number and size evolution over time. They transformed 
log data sequences into Markov chain models. Then, they 
computed the pairwise similarities by computing the ex- 
pected transition probabilities using the stationary distri- 
bution over the actions. They used the Jensen-Shannon di- 
vergence and the Hellinger distance between the expected 
transition frequencies of the Markov chains (more details in 
[21], as cited in [14]). They used k-means with an evolu- 
tionary clustering method that tracks the evolution of the 
similarities over time by smoothing the similarity matrices 
([31] as cited in [14]). [9] also modelled student behaviour 
using Markov chains. They randomly generated Markov 
chain priors and assigned each sequence to the prior most 
likely to generate it. Then, each prior would be updated to 
the Markov chain generated using its associated sequences. 
These last two steps were repeated until less than 5% of the 
sequences would change their prior. As they stated, this 
method is similar to k-means but with the clustering being 
dependent on the Markov chains instead of on a similarity 
measure performed directly on the sequences. 


[29] was able to detect unprofitable learning experiences and 
predict student performance by using time series data. They 
used dynamic time warping (DTW) to measure the distance 
between sequences and performed hierarchical clustering to 
find clusters. DTW was proved to be useful; however, to 
the best of our knowledge, it is not suitable for categorical 
sequences but only for numerical ones and has therefore been 
ruled out as a possible approach to our specific problem. 


Artificial neural networks (ANNs) are known to be useful 
for a vast variety of tasks. When it comes to time series, the 
most used ones seem to be recurrent neural networks (RNN) 
and long short-term memory networks (LSTM). RNNs’ main 
limitation is their difficulty to work with long sequences due 
to a vanishing gradient problem [12]. While LSTMs solve 
this issue, they usually take quite a long to train and have 
difficulties in capturing long-term dependencies in long se- 


quences [12, 30]. Finally, a novel approach called trans- 
former networks was introduced by [30]. This approach in- 
troduces what the authors called an attention mechanism, 
which solves the long-term dependency problem in LSTMs. 
[2] used LSTMs in a multi-module system to analyse the 
relationship between intent and user actions in interactive 
systems. [16] used time-aware LSTMs (T-LSTM), a spe- 
cial type of LSTM that can handle time irregularities, to 
model student knowledge state in continuous time. They 
conducted an empirical experiment and discovered that they 
outperformed regular LSTMs, logistic regression and recent 
temporal pattern mining (RTPs). [15] used RTPs along with 
support vector machine (SVM) and logistic regression to 
predict student performance and detect the need for inter- 
vention using students’ answers to programming exercises. 
They were able to classify students within only 1 minute 
into the exercise. 


3. EXPECTED CONTRIBUTION 


To our knowledge, this would be the first work to use student- 
platform interaction data in form of sequences of actions to 
predict help-seeking behaviour while being independent of 
the topic being taught. Our approach differs from exist- 
ing work by joining three main aspects. First, help-seeking 
behaviour: we have found works that linked the need for in- 
tervention with performance or student knowledge [6, 15, 16] 
instead of analyzing the behaviour surrounding actual help- 
seeking actions. Second, time-awareness: we have found 
works that have used cumulative data (e.g. number of at- 
tempts) to predict the need for intervention [18] but did 
not take into account the temporal context. Third, topic- 
independence: we have found works that did take into ac- 
count the temporal context but focused on the content of 
student answers for specific topics [15, 23]. 


The approach we propose would include the temporal con- 
text, would not be dependent on the nature of the content 
being taught and would focus on user-platform interaction 
(i.e. clicking, typing, deleting, consulting theory, etc). 


While this research is still in the early stages, we believe 
in the importance of students’ affective state [26, 27, 28] 
and might consider including the affective context if possi- 
ble. Finally, if we were to find successful results, we believe 
that richer predictions could be obtained by joining student 
knowledge information [6, 15, 16] along with the information 
learnt from help-seeking behaviour. However, this is out of 
the scope of this research. 


4. RESEARCH QUESTIONS 


We aim to study whether our proposal would be feasible and 
for that we present two research questions: 


Q1 Are there temporal patterns in students help-seeking 
behaviour? 


Q2 Can temporal student-platform interaction data be used 
to detect students who need help but do not ask for it? 


5. PROPOSED METHODOLOGY 


We will be dealing with both supervised and unsupervised 
problems: we will be using clustering algorithms towards 
answering Q1 and prediction algorithms towards answering 
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Q2. We propose to perform clustering (Q1) as a preliminary 
step to a more complex system (Q2). Clustering can lead 
to interpretable results and reveal information that could be 
useful to pedagogical experts while some prediction methods 
are more powerful but may act as a black box. As well as 
considering less interpretable methods, the system proposed 
in Q2 addresses personalization. We expose the methodol- 
ogy we intend to follow, and the methods we have considered 
so far. 


5.1 Data 


The dataset to be used is yet to be found or constructed. 
Efforts are being made to find a suitable dataset. Some 
promising options are being considered but are yet to be 
confirmed. Even though, the characteristics that we look for 
in a dataset have been defined. The dataset should contain 
action logs that originated from the interaction between a 
student and a learning platform that has some kind of help 
tool that the student can choose to use. Each log should 
include, at least: (1) action type, (2) action start time, (3) 
action end time, (4) student identification (anonymized) and 
(4) exercise identification. 


Given that some actions are continuous rather than instan- 
taneous, we will need to decide how to represent this charac- 
teristic. As an example, a student might consult the theory 
section of the system just for 10 seconds, or they could spend 
5 minutes consulting the content. It would be desirable that 
those two cases were not represented in the same way and 
that duration was taken into account. When using Markov 
chains, if we consider action durations, the probabilities of 
staying in the same state will always be 0, and the duration 
would not be taken into account. To solve this, we could 
consider splitting the actions into time slots. We will need 
to take into account that some other actions might be in- 
stant actions, with practically no duration, e.g. submitting 
an exercise. We will need to make sure that the model we 
use does not undermine these actions. Finally, if possible, 
we might consider including idle actions, that is, time in 
which the student does nothing. 


5.2 Clustering 

To answer Q1, we encounter two main decisions: how to 
determine the distance between sequences and which clus- 
tering algorithm to use. 


5.2.1 Distance between the sequences 

The main challenge of dealing with sequential data is that 
they cannot be directly fed to traditional clustering algo- 
rithms. First, one needs to decide how to represent the 
sequences and define how to compare them. We have de- 
cided to try two methods for representing the distance be- 
tween sequences: Markov chains and the Levenshtein dis- 
tance. Markov chains represent a sequence by considering 
the probability of going from one state (i.e. action) to an- 
other. The basic form of a Markov chain only considers 
the current state to predict the next one. This could be 
a limitation and therefore n-order Markov chains could be 
considered. In a Markov chain of order n, n previous steps 
are taken into account. On the other hand, the Levenshtein 
distance is a type of edit distance, that is, the minimum 
changes required to transform one sequence into another. 


The Levenshtein distance considers insertions, deletions and 
substitutions. 


5.2.2 Clustering algorithms 

Taking into account existing work, we have narrowed the 
search for a clustering method down to two: hierarchical 
clustering and k-means. The main drawback of k-means is 
the requirement of a predefined number of clusters, which 
in our case is unknown. Hierarchical clustering has the ad- 
vantage that the number of clusters can be chosen a poste- 
riori, however, it can be expensive when dealing with large 
datasets. K-means is usually a fast algorithm, although it 
might depend on the chosen distance metric [19]. 


5.3. Prediction 


Towards answering Q2, we propose a prediction system; its 
characteristics are presented in this section. 


5.3.1 System structure 

While we want to take advantage of how students in general 
behave, we want to provide a personalized learning experi- 
ence. To do so, a student’s personal traits and tendencies 
must be taken into account. Therefore, we aim to take ad- 
vantage of the general traits of student behaviour while pre- 
serving the personal study tendencies of each student, thus 
combining an inter-subject with an intra-subject approach. 
To achieve this goal, we propose an ensemble system com- 
posed of three blocks. We name the system SHEmblE (Se- 
quence analysis of Help-seeking behaviour with an ensEM- 
BLE model for Educational systems) 


Firstly, we will have a prediction model that will be trained 
with all the available data. We will refer to this model as 
the common model as it will be shared among all students. 
We expect it to be able to learn the general patterns of help- 
seeking behaviour if those exist. 


Secondly, we will have what we call a personal model. Each 
student will have each own personal model trained with their 
own data, if any. We expect this model to be able to learn 
the personal tendencies and preferences of a student. 


Finally, a third model will combine the predictions of com- 
mon and personal models. We call this model the ensemble 
model and we expect it to learn how to combine the predic- 
tions the best way possible. 


We will focus on students who regularly ask for help for the 
general model. However, from those students, we will take 
into account interactions that exhibit help-seeking behaviour 
as well as those in which the student does not need help to 
successfully reach their goal. Sequences from students who 
never ask for help will not be included as we cannot know 
if the student did really not need assistance, or they simply 
never ask for it. 


We are aware that data size will be a concern regarding 
the personal model. Its goal is to provide individualization, 
and thus, we believe it is an important part of the system 
[1, 7, 22]. Therefore, the ensemble model could take into 
account the amount of data with which the personal model 
was trained in order to weigh the predictions properly. 
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Figure 1: General schema of SHEmbIE, the proposed system. 


The ultimate goal, if this system was to achieve good results, 
would be to implement it in a real educational system. Fig- 
ure 1 represents the overall structure of the proposed sys- 
tem. The idea would be to be able to detect, in real-time, 
students that need help and offer it to them. Apart from 
collecting logs from students that ask for help themselves, 
whenever we offer help we would save their response as well. 
The scope of this work comprises the common, personal and 
ensemble models, the rest could be the object of study of 
future research. 


5.3.2 Prediction algorithms 

Time series prediction has been the challenge of many works 
in literature. This work deals with categorical time series, 
in other words, categorical sequences. It has been nar- 
rowed down to three methods: artificial neural networks, 
hidden Markov models, and recent temporal patterns. As 
exposed in section 2, ANNs have been used in problems in- 
volving time series data and showed promising results, dif- 
ferent types found in the literature will be considered (e.g. 
LSTM, T-LSTM, transformers). HMMs have also been use- 
ful for predicting and classifying action sequences. [20] found 
that HMMs needed fewer training samples and less CPU 
time while performing similar to LSTMs. Finally, RTPs [3] 
have been successful at similar tasks. [15] used them and 
managed to detect students that needed intervention only 
one minute after starting an exercise. While their data con- 
sisted of attributes of the students’ answers’ content and 
ours will consist of interaction data, we believe that a simi- 
lar approach could be applied to our particular task. 


5.3.3 System training and evaluation 

The system we propose is going to be composed of three 
different models. These models will be evaluated indepen- 
dently and altogether. We intend to evaluate the common 
model by performing a variation of the leave-one-out cross- 
validation (LOOCV) in which in each iteration the whole 
data of one student is left out for validation. Regarding 
the personal model, the dataset will be split by the se- 
quences’ student id and for each student, a LOOCV will be 
performed. The performance of the personal model will be 
assessed by combining all the performances (eg. mean and 
standard deviation) and special attention will be paid to pos- 
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Figure 2: Graphical representation of the evaluation scheme 
and the generation of the dataset for the ensemble model. 


sible outliers. Finally, the ensemble model block will need to 
be fed the predictions of the other two models. Therefore, a 
whole new dataset E& will need to be constructed such that: 


e Consider the set of n students S = {s,|i € {1..n}}. 


e Each student s; has got m; sequences 
Q: = {qij|9 € {1..mi}} 


e The instance e;; will correspond to the sequence j of 
the student i and will contain at least 2 features: 


— The output of the common model trained using 
the sequences from students other than s;. 


— The output by the personal model trained using 
the sequences of student s; other than qi;. 


Moreover, additional features could be added, such as 
the size of the dataset used to train the personal model, 
given that some students might have few or no data. 


e The dataset F will then have Saar mM, rows. 


Once the dataset has been constructed, k-fold cross-validation 
can be performed. Figure 2 shows a graphical representation 
of the proposed evaluation method. 


6. CONCLUSIONS 


We have not found works that aim to detect students who 
need help by analysing behaviour around help-seeking ac- 
tions using time-aware user-platform interaction data. In 
this Master Thesis, we aim to study whether such data can 
be useful to predict help-request actions and propose an en- 
semble system that combines a shared model and a personal 
model so as to achieve individualization. 


This work is still at a very early stage. Any feedback and 
ideas on this proposal are very much welcomed. Specifically, 
comments on the sequence representation, and the clustering 
and predictive model choices will be appreciated. 


7. ACKNOWLEDGEMENTS 

The work is partially supported by UNED’s master’s de- 
gree in Research in AI and the project INT2AFF funded 
under Grant PGC2018-102279-B-I00 (MCIU/AEI/FEDER, 
UE) by the Spanish Ministry of Science, Innovation and Uni- 


versities, the Spanish Agency of Research and the European 
Regional Development Fund (ERDF). 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 841 


8. 
[1] 


[2 


[3 


[6] 


[9] 


10 


11 


12 


13 


[14] 


842 


REFERENCES 


M. Abell. Individualizing learning using intelligent 
technology and universally designed curriculum. The 
journal of technology, learning and assessment, 5(3), 
2006. 

R. Agrawal, A. Habeeb, and C. Hsueh. Learning user 
intent from action sequences on interactive systems. In 
Thirty-Second AAAI Conference on Artificial 
Intelligence, page 59-64, New Orleans, Louisiana, 
USA, February 2-7 2018. 

I. Batal, D. Fradkin, J. Harrison, F. Moerchen, and 
M. Hauskrecht. Mining recent temporal patterns for 
event detection in multivariate time series data. In 
Proceedings of the 18th ACM SIGKDD international 
conference on Knowledge discovery and data mining, 
pages 280-288, 2012. 

C.-Y. Chou, K. R. Lai, P.-Y. Chao, $.-F. Tseng, and 
T.-Y. Liao. A negotiation-based adaptive learning 
system for regulating help-seeking behaviors. 
Computers & Education, 126:115-128, 2018. ID: 
271849. 

R. Conijn and M. V. Zaanen. Trends in student 
behavior in online courses. In 3rd International 
Conference on Higher Education Advances, pages 
649-657, 2017. 

A. T. Corbett and J. R. Anderson. Knowledge tracing: 
Modeling the acquisition of procedural knowledge. 
User modeling and user-adapted interaction, 
4(4):253-278, 1994. 

A. T. Corbett, J. R. Anderson, V. H. Carver, and 

S. A. Brancolini. Individual differences and predictive 
validity in student modeling. In Proceedings of the 
Sixteenth Annual Conference of the Cognitive Science 
Society, 1994. 

M. Desmarais and F. Lemieux. Clustering and 
visualizing study state sequences. In Proceedings of the 
6th International Conference on Educational Data 
Mining (EDM 2013), pages 224-227, 2013. 

C. Hansen, C. Hansen, N. O. D. Hjuler, $. Alstrup, 
and C. Lioma. Sequence modeling for analysing 
student interaction with educational systems. In 
Proceedings of the 10th International Conference on 
Educational Data Mining, EDM 2017, pages 232-237. 
International Educational Data Mining Society 
(IEDMS), 2017. 

S. Jarveléi. How does help seeking help?—new prospects 
in a variety of contexts. Learning and Instruction, 
21(2):297-299, 2011. 

S. A. Karabenick and R. S. Newman. Help Seeking in 
Academic Settings: Goals, Groups, and Contezts. 
Lawrence Erlbaum Associates, Inc, 2006. 

F. Karim, S$. Majumdar, H. Darabi, and S. Chen. 
Lstm fully convolutional networks for time series 
classification. JEEE access, 6:1662—1669, 2017. 

R. Kizilcec, C. Piech, and E. Schneider. 
Deconstructing disengagement. In Proceedings of the 
Third International Conference on Learning Analytics 
and Knowledge, LAK ’13, pages 170-179. ACM, 2013. 
S. Klingler, T. Kaser, B. Solenthaler, and M. Gross. 
Temporally coherent clustering of student data. In 
Proceedings of the 9th International Conference on 
Educational Data Mining (EDM 2016), pages 


[15] 


[16] 


[17] 


[18] 


[19] 


[20] 


[21] 


[22] 


[23] 


[24] 


[25] 


[26] 


[27] 


202-209, 2016. 

Y. Mao. One minute is enough: Early prediction of 
student success and event-level difficulty during novice 
programming tasks. In Proceedings of the 12th 
International Conference on Educational Data Mining 
(EDM 2019), pages 119-128, 2019. 

Y. Mao, S. Marwan, T. W. Price, T. Barnes, and 

M. Chi. What time is it? student modeling needs to 
know. In Proceedings of The 13th International 
Conference on Educational Data Mining (EDM 2020), 
pages 171-182, 2020. 

R. Martinez, K. Yacef, J. Kay, A. Al-Qaraghuli, and 
A. Kharrufa. Analysing frequent sequential patterns of 
collaborative learning activity around an interactive 
tabletop. In 4th International Conference on 
Educational Data Mining, EDM 2011, pages 111-120. 
CEUR-WS, 2011. 

T. Mu, A. Jetten, and E. Brunskill. Towards 
suggesting actionable interventions for wheel-spinning 
students. In 13th International Conference on 
Educational Data Mining (EDM 2020), pages 
183-193, 2020. 

E. Ofitserov, V. Tsvetkov, and V. Nazarov. Soft edit 
distance for differentiable comparison of symbolic 
sequences. 2019. Preprint available at 
https://arxiv.org/abs/1904.12562. 

M. Panzner and P. Cimiano. Comparing hidden 
markov models and long short term memory neural 
networks for learning action representations. In 
International Workshop on Machine Learning, 
Optimization, and Big Data, pages 94-105. Springer, 
2016. 

L. Pardo. Statistical inference based on divergence 
measures. Chapman and Hall / CRC Press, 2018. 

Z. A. Pardos and N. T. Heffernan. Modeling 
individualization in a bayesian networks 
implementation of knowledge tracing. In International 
Conference on User Modeling, Adaptation, and 
Personalization, pages 255-266. Springer, 2010. 

C. Piech, J. Huang, A. Nguyen, M. Phulsuksombati, 
M. Sahami, and L. Guibas. Learning program 
embeddings to propagate feedback on student code. In 
International conference on machine Learning, pages 
1093-1102. PMLR, 2015. 

C. Piech, J. Spencer, J. Huang, S. Ganguli, 

M. Sahami, L. Guibas, and J. Sohl-Dickstein. Deep 
knowledge tracing. In Proceedings of Advances in 
Neural Information Processing Systems 28 (NIPS 
2015), 2015. 

I. Roll, R. S. d. Baker, V. Aleven, and K. R. 
Koedinger. On the benefits of seeking (and avoiding) 
help in online problem-solving environments. Journal 
of the Learning Sciences, 23(4):537—-560, 2014. 

S. Salmeron-Majadas, R. S. Baker, O. C. Santos, and 
J. G. Boticario. A machine learning approach to 
leverage individual keyboard and mouse interaction 
behavior from multiple users in real-world learning 
scenarios. IEEE Access, 6:39154-39179, 2018. 

S. Salmeron-Majadas, O. C. Santos, and J. G. 
Boticario. An evaluation of mouse and keyboard 
interaction indicators towards non-intrusive and low 
cost affective modeling in an educational context. 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


[28] 


[29] 


[30] 


[31] 


[32] 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


Procedia Computer Science, 35:691—700, 2014. 

O. C. Santos. Emotions and personality in adaptive 
e-learning systems: an affective computing perspective, 
pages 263-285. Emotions and personality in 
personalized services. Springer, 2016. 

S. Shen and M. Chi. Clustering student sequential 
trajectories using dynamic time warping. In 
Proceedings of the 10th International Conference on 
Educational Data Mining (EDM 2017), pages 
266-271, 2017. 

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, 

L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 
Attention is all you need. In Proceedings of the 31st 
Conference on Neural Information Processing Systems 
(NIPS 2017), 2017. 

K. S. Xu, M. Kliger, and A. O. H. Iii. Adaptive 
evolutionary clustering. Data Mining and Knowledge 
Discovery, 28(2):304-336, 2014. 

M. Yao, S. Sahebi, and R. F. Behnagh. Analyzing 
student procrastination in moocs: A multivariate 
hawkes approach. In 13th International Conference on 
Educational Data Mining (EDM 2020), pages 
280-291, 2020. 


843 


Mixed Data Sampling in Learning Analytics 


Julian Langenhagen 
Goethe University Frankfurt, Germany 


langenhagen @econ.uni- 
frankfurt.de 


ABSTRACT 


Technical progress facilitates collecting large amounts and new 
kinds of data in a wide range of areas. That enables versatile new 
possibilities in empirical research, especially with high-frequency 
data. However, researchers are confronted with the problem that 
not all available data have the same (high) frequency. For many 
common methods, it is necessary to adjust the high-frequency to 
the low-frequency data, resulting in a significant loss of 
information. In accounting research, this kind of problem exists 
due to the low-frequency reporting data of companies on the one 
hand and the high-frequency financial market data on the other. A 
promising solution to this problem is the innovative approach of 
mixed data sampling (MIDAS). Since the coexistence of low- 
frequency data (e.g., exam grades) and high-frequency data (e.g., 
learning management system usage data) is also prevalent in 
educational settings, this paper will discuss the first application of 
MIDAS in the field of learning analytics. 


Keywords 


time series, mixed data sampling, regression, prediction models 


1. INTRODUCTION 


Educational Data Mining (EDM) is a comparatively young field 
of research. Even though certain research methods within this area 
are already established, the field regularly benefits from valuable 
contributions from interdisciplinary research approaches [e.g., 
14]. The methods used in Educational Data Mining can be divided 
into four different areas: prediction models, structure discovery, 
relationship mining, and discovery with models [2]. The focus of 
this paper lies on prediction models, especially on those with 
time-series data. Methods in this area include classifications, 
regressions, and latent knowledge estimation [2]. In this area, 
EDM researchers are confronted with a specific problem. The 
variety of available data sources is growing due to the use of 
complex learning management systems (LMS), game-based 
learning applications, or other digital educational tools. These 
sources often contain high-frequency data and therefore offer a lot 
of potential information. However, in the most commonly used 
methods in prediction models, it is usually the case that data from 
different samples have to be brought to the same frequency to be 
analyzed. This can lead to a significant loss of information. For 
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example, data from an LMS can be gathered for every possible 
usage second to be used as an explanatory variable. However, 
there is usually a low-frequency variable on the other side of the 
equation, such as the exam grade or score. Accounting researchers 
face a similar problem. Companies usually only publish reports at 
a pre-defined low frequency, as financial statements or other 
reports are often made available only annually or quarterly. An 
independent variable available in corresponding research settings 
is, for example, the share price, which, like the data from the 
LMS, can be collected at a very high frequency. However, the 
information contained here cannot be fully included in the 
analysis if the variables on both sides of the equation have to be 
adjusted to the lowest frequency available. A solution to this 
problem in accounting is the method of mixed data sampling 
(MIDAS). As the underlying problem is comparable to typical 
educational research settings, this method will be examined in 
more detail in the following section, and then a possible 
application in Educational Data Mining will be discussed using 
the example of a concrete implementation in the context of a 
learning app in higher education. 


2. MIDAS IN ACCOUNTING RESEARCH 


The basic MIDAS model builds on a regression equation where 
the dependent variable is measured in a lower frequency than one 
or more of the independent variables [5]. The problem of the 
different frequencies on both sides is solved with two separate 
components. In the first component, each time disaggregated 
observation of the higher frequency variable is included separately 
as an independent variable. In other words, if the higher frequency 
data is observable N times within a period, then N separate 
independent variables are included in the regression. This allows 
the independent variable’s effect on the dependent variable to 
evolve over the course of the examined period, even though the 
dependent variable was only measured once. The second 
component of MIDAS is the requirement of each of the N 
regression coefficients to follow a specific function of time that is 
shaped by few estimated parameters. For example, if the temporal 
distribution is assumed to be linear, this condition only requires 
two parameters, namely an intercept and a slope. This condition is 
the key feature of MIDAS to establish a balance between model 
flexibility and parsimony to be able to reasonably interpret the 
results. This basic model can be enhanced in many different 
directions, e.g., to whether certain events within the observed 
period have a different relationship to non-event data [5] or for the 
evaluation of unequally spaced temporal data [11]. MIDAS has so 
far been used mainly in accounting and macroeconomics research 
[e.g., 4, 5, 8]. The method is particularly well suited to accounting 
research as companies are legally obliged to publish certain 
economic data such as revenues and costs on a regular low- 
frequency cycle (e.g., quarterly or annually). Share prices, on the 
other hand, can generally be retrieved every second and are 
therefore high-frequency. The job of professional analysts is to 
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use this information, among other things, to predict the 
companies’ disclosures. For forecasts based on regressions, it is 
usually necessary to adjust the frequency of the different data 
samples to the lowest frequency available. For a share price, this 
could result in the quarterly mean, for example. Therefore, the 
information within the high-frequency stock market data lost in 
this process represents a major challenge for analysts to optimize 
their forecasts. A previous study has shown that MIDAS can 
significantly help analysts with this challenge and improve the 
forecasts accordingly [6]. Building on these findings, the next 
section will discuss whether MIDAS can also help lecturers and 
researchers in education with specific predictions. 


3. MIDAS IN LEARNING ANALYTICS 


Prediction models belong to the most used methods in 
Educational Data Mining [1, 7]. Usually, the corresponding 
analyses are carried out with classifications or regressions [3]. The 
prediction of exam grades or scores is one of the most frequently 
investigated research questions within prediction models [13]. The 
dependent variables are usually the exam points (regression), the 
exam grade (classification or regression), or the fact of whether a 
student has passed or not (classification). In many cases, usage 
data from learning management systems are used as independent 
variables. This data is usually high-frequency and must be 
adjusted to the low-frequency dependent variables for the methods 
mentioned above. For example, an LMS may record all clicks 
within the system with precise temporal data. Still, in the above 
analyses, this information needs to be restricted, for example, to 
the total number of clicks over the entire period of use [9]. These 
limitations could be overcome with more sophisticated methods. 
For instance, it was shown that GARCH, a method from finance 
research, outperformed the other common methods in multi-modal 
learning analytics [14]. As of this writing, there is no publication 
in the field of Educational Data Mining or Learning Analytics that 
has used the MIDAS approach. This research gap should be filled 
with the present project. In a subsequent step, MIDAS could even 
be linked to GARCH to further enrich the research setting [10]. 
Thus, the research question of this project is whether MIDAS is 
suitable for predicting exam results in an educational context and 
how the results compare to those of already known methods in 
Educational Data Mining. Previous studies have shown that it is 
important to look not only at aggregate usage data for a given time 
period but also at the distribution and sequence of the 
corresponding data points [e.g., 9, 12]. MIDAS could make a 
valuable contribution to the range of methods already available, as 
it takes into account the high frequency of independent variables 
while still providing well-interpretable and thus actionable results 
for instructors. These results could, for example, be used to build 
an early warning system for students at risk of academic failure. 
The good interpretability of MIDAS results could make it easier 
for instructors to take appropriate measures compared to when 
using complex machine learning algorithms, whose results might 
be much more challenging to interpret. Such an early warning 
system is especially beneficial in lectures where there is little 
performance feedback between students and teachers in general 
(e.g., because there is only one final exam at the end of the 
semester) or due to special conditions (e.g., COVID-19). In such 
cases, all actors involved see the result of learning and teaching 
behavior only at the end of the semester through the exam result. 
Since it is already too late for countermeasures at this point, an 
early warning system with clear recommendations for action 
would be beneficial in such a context. Therefore, MIDAS is a 


promising addition to the current variety of methods and should 
be considered in future studies. 


4. NEXT STEPS 


A first application of the basic MIDAS model will be carried out 
in the following setting as soon as the project’s data collection is 
completed. We developed a mobile learning app for an 
undergraduate accounting course at a large public university in 
Europe. The course is compulsory and ought to be taken in the 
third semester of the bachelor’s program. The course is taken by 
approximately 600 students per semester and consists of a weekly 
lecture, a biweekly exercise, and biweekly tutorials (five meetings 
in small groups). The content of this course includes the basics of 
cost accounting as well as a summary of their significance and 
classification in the management accounting context. The primary 
learning material consists of a slide deck, a collection of exercises 
(with solutions), and a trial exam (all available as PDF files). In 
the evaluations of earlier semesters, students often complained 
that there were no contemporary possibilities to learn the subject 
matter. Therefore, we decided to develop an additional learning 
tool in the form of a smartphone app, which was launched in the 
summer semester of 2019. The use of the app is voluntary, and no 
extra credits or advantages for the final exam can be earned by 
collecting points in the app. The tool is available via a web 
version and as an app in the Google Play Store and the Apple App 
Store. The app’s core element is a database with over 550 
questions that covers all nine chapters of the course. In addition to 
the question types single and multiple-choice, there are also 
sorting and cloze text tasks. The app can be used in three modes: 
The chapter mode can be used to answer specific questions about 
a single chapter. As soon as a student has mastered the problems 
of one chapter, the next chapter is unlocked. In random mode, 
questions are randomly selected from the chapters that have 
already been unlocked in chapter mode. In the third mode, the so- 
called Weekly Challenge, users can compare themselves with 
other students. Once a week, they have the opportunity to answer 
25 questions randomly selected from the chapters already covered 
in the lecture. The results are subsequently displayed in a weekly 
and a semester ranking. For good performances in the Weekly 
Challenge and other learning achievements, students can earn so- 
called badges, which are then displayed in their account under 
their self-chosen username. By answering questions (regardless of 
the mode), students also earn learning points and thus increase 
their learning level. The progress display of the individual 
chapters shows students how well they currently master a 
particular topic. The app has been specifically designed to 
complement the existing course and is not intended to replace 
other learning materials such as the slides or the collection of 
exercises. The app contains an individual explanation for each 
question which is displayed if a wrong answer is given. Thus, 
students can work their way through the catalog of questions 
independently of time and place and eliminate any gaps in their 
understanding without having to rely on the presence of the 
lecturers. This is an essential value-added for the students, 
especially in such a large course with approximately 600 students 
per semester. The collected app data consists of details about the 
usage behavior of each student (e.g., time of use, performance 
(history) regarding every question, and earned badges). At this 
stage, we already have four semesters of app usage, and the data 
set is growing as the research project is still ongoing. This is 
especially promising as the situation regarding COVID-19 lead to 
an exogenous shock. While in the years 2018 and 2019, the course 
was held face-to-face, in the summer semester 2020, it was 
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converted into a purely online lecture. Apart from the launch of 
the app and the switch to an online lecture, there were no teaching 
design changes over the course of the semesters. The lecturer and 
the learning materials remained constant, as well as the design and 
the grading of the final exam. This unique setting could provide 
valuable insights into the impact of COVID-19 on_ higher 
education. The starting point of the corresponding analysis would 
be a basic linear regression with the exam score as dependent and 
usage data from the app as independent variable. The exam score 
is measured once, while the usage data from the app could be 
evaluated by every second of the semester. In this setting, we face 
the challenge of unequal frequencies on both sides of the equation 
that was described before. If we would only take the sum of total 
questions answered by a student as the independent variable, we 
would miss a lot of information. The type and especially the time 
of usage can be decisive for the effect on the exam score. We 
would miss all this information with reducing the app usage on 
measures like total questions answered. Therefore, the MIDAS 
approach offers a promising possibility to extract more insights 
from the data set. A comparative analysis with other already 
known methods in Educational Data Mining or Learning 
Analytics, which take into account the high frequency of data, 
could highlight the additional benefits of MIDAS for this research 
area. 


5. CONCLUSION AND FUTURE WORK 


In this paper, it was shown that the innovative approach MIDAS 
could be a promising extension of the variety of methods in 
Educational Data Mining and Learning Analytics. Further insights 
will be gained by testing the approach with the usage data from 
our gamified learning app. Based on the findings, it will be further 
discussed whether the basic model of MIDAS should be extended 
for the use in an educational setting or whether other novel 
methods should be applied in this setting. Besides, it could be 
promising to apply the MIDAS approach to already published 
analyses in order to test the corresponding added value. If the 
results are similarly insightful as those in accounting research, 
MIDAS could find numerous use cases in Educational Data 
Mining and Learning Analytics. 
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ABSTRACT 


Clustering is an important technique in learning analytics 
for partitioning students into groups of similar instances. 
Application examples include group assignments, students- 
class allocation, etc. However, traditional clustering does 
not ensure a fair-representation in terms of some protected 
attributes like gender or race, and as a result, the result- 
ing clusters might be biased. Moreover, traditional cluster- 
ing might result in clusters of varying cardinalities reduc- 
ing their actionability for end user. In many applications, 
like group assignment, the capacity of the resulting clusters 
should be controllable to allow direct applicability of the re- 
sulting clusters. Furthermore, it is important to be able to 
explain why an instance/student is clustered into a specific 
cluster and/or which attributes play a crucial role in the 
clustering process. We believe that the aforementioned as- 
pects of fairness, capacity and explainability are important 
for the successful application of clustering in the learning 
analytics domain. 


Keywords 
learning analytics, clustering, fairness, bias, explainability, 
capacity, actionability 


1. INTRODUCTION 


In education, machine learning (ML) has been used in a 
wide variety of decision-making tasks, for example, student 
dropout prediction [11], education admission decisions [25] 
or forecasting on-time graduation of students [16]. Recently, 
the incidents of discrimination in ML-based decision-making 
systems in education, such as grades prediction [4, 15], are 
an important reason for the increase of the attention to bias 
and fairness in ML of researchers [32]. Accordingly, the de- 
cisions made by the ML-based systems against groups or 
individuals on the basis of protected attributes like gender, 
race, etc. Bias in education has been studied in many as- 
pects from different sources of bias in education [27], stu- 
dents’ data analysis [3], racial bias [39] and gender bias [26]. 
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However, ML-based decision-making systems have the po- 
tential to amplify prevalent biases or create new ones and 
therefore, fairness-aware ML approaches are required also 
for the learning environments. 


In our research, we are focusing on the fairness of clustering 
methods in learning analytics since clustering is an effective 
method to analyze student data [8, 17, 28, 36]. Cluster- 
ing algorithms are useful tools for partitioning students into 
groups of similar instances [3, 31]. Results from cluster- 
ing methods are applicable in educational activities such as 
group assignments [10] and student team achievement divi- 
sions [37]. However, the traditional clustering algorithms do 
not take into account the fairness w.r.t. protected attributes 
like gender or race, as a consequence of focusing only on the 
similarity objective. Moreover, the cardinality of the result- 
ing clusters is typically not part of the objective function and 
as a result clusters of very different cardinalities might be 
extracted reducing the usefulness of the results. Moreover, 
understanding the instances-to-clusters assignments, the im- 
portant features for clustering and what characterizes each 
cluster (the so-called, cluster labels) is not always easy [33]. 


The aim of this research is to study the fairness, capacity 
and explainability requirements and challenges in the learn- 
ing analytics domain and propose effective solutions that can 
be used by the domain experts. In this direction, we pro- 
pose the concept of fair-capacitated clustering which extends 
traditional clustering focusing on clustering quality to also 
ensure fairness of representation in terms of some protected 
attribute(s) and the applicability of the resulting clusters by 
ensuring balanced cluster cardinalities. Such clusters can be 
exploited by different stakeholders in the learning environ- 
ment: educators can better organize the learning activities, 
e.g., group assignments; students can learn better in a more 
inclusive and equitable environment. 


In another direction, we plan to extend the fair capacitated 
clustering with explainability to give insights to the end users 
about how certain assignment decisions are made, what fea- 
tures are important for clustering and what the extracted 
clusters represent. Such information will allow educators to 
customize teaching activities to each group and improve the 
learning trajectory of each student, each group and the class 
in overall. 


We believe that the results of our research are useful in 
other domains as well, for example business (clustering cus- 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 847 


tomers in marketing studies, salesmen areas distribution), 
traffic (vehicle routing) and communication (network de- 
sign). Moreover, our research contributes to the further 
development of the domain of fairness and responsible AI 
with new methods (for the unsupervised learning problem) 
and application domain (learning analytics). 


The rest of our paper is structured as follows: Section 2 
overviews the related work. Research questions are pre- 
sented in Section 3. Section 4 describes our ongoing work on 
fair-capacitated clustering and preliminary results. Finally, 
conclusions and outlook are presented in Section 5. 


2. RELATED WORK 


Chierichetti et al. [7] first introduced the fair clustering prob- 
lem and presented a balance measure for computing fairness 
in the resulting clusters. They defined “fairlet” as a small 
cluster preserving fairness measure, and then they apply k- 
Center clustering algorithm on these fairlets to obtain the 
final clusters. In the later studies, Backurs et al. [1] de- 
scribed an algorithm for the fairlets computation in nearly 
linear time. The problem of fair clustering with multiple 
protected attributes is investigated in the researches of Rés- 
ner and Schmidt [34] and Bera et al. [2]. 


The capacitated clustering problem (CCP) was first intro- 
duced by Mulvey and Beck [30] with heuristic and subgra- 
dient algorithms. Later, researchers proposed approaches to 
solve the problem in the different clustering methods. For 
instance, Khuller and Sussmann [19] introduced an approxi- 
mation algorithm for the capacitated k-Center problem. An 
improved version of k-Means algorithm for CCP was pre- 
sented by Geetha et al. [12] with the use of a priority mea- 
sure to assign points to their centroid. Lam and Mittenthal 
[20] proposed a heuristic hierarchical clustering method for 
CCP. 


Quite a few researchers, recently, are interested in the use- 
fulness of explainable and interpretable clustering models. 
Chen et al. [6] proposed a probabilistic discriminative model 
with the ability to learn rectangular decision rules for each 
cluster. Saisubramanian et al. [35] offered a voting method 
to consider which features are meaningful for the end user. 
Moshkovitz et al. [29] used an unsupervised decision tree to 
explain k-Means and k-Medians methods. 


3. RESEARCH QUESTIONS 


We organize the challenges into the research questions Q1— 
Q3 explained hereafter: 


Q.: What is fairness in learning analytics and how to mit- 
igate discrimination in clustering? Fairness in education is 
an interesting topic researchers [5, 9, 13]. We investigate 
the fairness terminology in student analytics w.r.t protected 
attributes such as gender, race. Student performance can 
be considered as the protected attribute because in some 
cases no knowledge of the student’s performance can help 
to prevent bias in the grading procedure [23, 24]. Related 
work in the fairness-aware ML area depicts a large variety of 
approaches that can be categorized into: i) pre-processing 
approaches that intervene at the input data [22]; ii) in- 
processing approaches that directly tweak the clustering al- 
gorithm to account for fairness [7] and iii) post-processing 


approaches that adjust the clustering results to ensure fair- 
ness [38]. We will mainly follow the in-processing approaches 
that directly incorporate fairness in the clustering process. 
However, such approaches depend on the clustering algo- 
rithm per se; our current work focuses on hierarchical and 
partitioning algorithms, in the future density-based cluster- 
ing will be also investigated. 


Qe2: How to satisfy multiple objectives, namely capacity of 
clusters and fairness of representation on top of the (stan- 
dard) cluster similarity objective? As already mentioned, the 
actionability of the results is important. As a concrete exam- 
ple consider group assignments: groups should be compara- 
ble to allow for a fair allocation of work among students. In 
the capacitated clustering problem [30], they do not consider 
fairness, nor explainability. Likewise approaches for fair 
clustering also exist [7]. However, approaches that jointly 
consider the different objectives do not exist. 


Q3: What is the explanation of a (fair-capacitated) cluster- 
ing model and how to find it? The importance of explain- 
able clustering results for the end users has been already dis- 
cussed. Explainability does not only allow for understanding 
how certain decisions are made but also allows for debugging 
of algorithmic decisions and corrections in case of decisions 
based on protected attributes like gender or race. There are 
different aspects to explainability in clustering: understand- 
ing how a certain assignment of an instance to a cluster was 
made, understanding what attributes contributed to clus- 
tering and explaining what each cluster is about (or cluster 
labeling). We will investigate the different aspects to allow 
educators to better understand the groups that are formed 
and to allow both educators and single users/students to 
understand how they fit into a particular cluster. 


4. PRELIMINARY RESULTS ON FAIR CA- 
PACITATED CLUSTERING 


In this section, we present the preliminary results of our 
work namely fair-capacitated clustering [21] problem. The 
goal is to cluster students into fair-groups w.r.t. single pro- 
tected attribute. Gender, typically, is chosen as the pro- 
tected attribute. In other words, we would like to balance 
the number of males and females in the resulting clusters 
and our proposed methods should satisfy the size of group 
constraint in order to make the results more actionable. 


We define the problem of (t, k, q)-fair-capacitated clustering 
as finding a clustering C = {C),---C,} that partitions the 
data X into k clusters such that the cardinality of each clus- 
ter Ci; € C does not exceed a threshold gq, i.e., |Ci| < q (the 
capacity constraint), the balance of each cluster is at least t, 
i.e., balance(C) > t (the fairness constraint), and minimizes 
the objective function. Parameters k, t, q are user-defined re- 
ferring to the number of clusters, minimum balance thresh- 
old and maximum cluster capacity, respectively. 


We present a two-step solution to the problem: i) we rely on 
fairlets [7] to generate minimal sets that satisfy the fair con- 
straint and ii) we propose two approaches, namely hierarchi- 
cal clustering (denoted by hierarchical fair-capacitated) and 
partitioning-based clustering (denoted by k-Medoids fair- 
capacitated, to obtain the fair-capacitated clustering. The 
hierarchical approach embeds the additional cardinality re- 
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quirements during the merging step while the partitioning- 
based one alters the assignment step using a knapsack prob- 
lem formulation to satisfy the additional requirements. 


We experiment our proposed methods on four educational 
datasets: UCI Student performance’, PISA test scores’, 
OULAD®, MOOC%, containing the demographics, grades 
and school-related attributes of students. Table 1 in Ap- 
pendix A summarizes the characteristics of datasets. 


We report on clustering quality (measured as clustering cost, 
see Eq. 1), cluster fairness (expressed as cluster balance [7], 
see Eq. 2 and Eq. 3) and cluster capacity (expressed as 
cluster cardinality). The parameters are set as follows: the 
minimum threshold of balance t = 0.5, i.e., the proportion of 


the minority group is at least 50% in the resulting clusters; 
|X| *€ 


the maximum capacity of clusters q = | |; € is set to 


1.01 and 1.2, for k-Medoids fair-capacitated and hierarchical 
fair-capacitated methods, respectively. 


£OG0) = dla sa) (1) 


s,ES rEeCy 


+) — min ( HreCily(@)=0}| [{weO;|v(w)=1} 
balance(Ci) = min ( {Se RSH {xe Cj |p(@)=0} 


balance(C) = gnin, balance(C;) 


The baseline includes well-known clustering methods with 
fairness-aware approaches and a traditional algorithm. 1) k- 
Medoids[18]. This is a traditional partitioning technique of 
clustering that uses the actual instances as centers (medoids) 
and divides the dataset into k clusters and minimizes the 
clustering cost. 2) Vanilla fairlet [7]. A vanilla fairlet de- 
composition that ensures fair clusters is generated, then, a 
k-Center clustering algorithm [14] is applied to cluster those 
fairlets into & clusters. 3) MCF fairlet [7]. It is an updated 
version of the Vanilla fairlet with The fairlet decomposition 
is transformed into a minimum cost flow (MCF) problem, 
by which an optimized version of fairlet decomposition in 
terms of cost value is computed. 


The preliminary results show that our approaches deliver 
well-balanced clusters in terms of both fairness and cardi- 
nality while maintaining a good clustering quality. In terms 
of clustering cost (Figure 1-a) (Appendix B), our approaches 
outperform the vanilla fairlet and MCF fairlet methods al- 
though they are worse compared to the vanilla k-Medoids 
clustering. This is obvious due to the fact that our meth- 
ods have to satisfy constraints on fairness or/and cardinality. 
MCF fairlet hierarchical fair-capacitated shows the best per- 
formance due to the optimization in the merging step. As 
illustrated in Figure 1-b regarding to fairness, our methods 
are comparative to the competitors. In which, the minimum 
threshold of balance t¢ is visualized as a dashed line and the 


‘https: //archive.ics.uci.edu/ml/datasets /Student+Performance 


“https: //www.kaggle.com/econdata/pisa-test-scores 
3https: / /analyse.kmi.open.ac.uk/open_dataset 
“https: //github.com/kanika-narang/MOOC_Data_Analysis 


actual balance from the dataset is plotted as a dotted line. 
In Figure 1-c, the maximum capacity thresholds q are pre- 
sented by the dashed and dotted lines. Our approaches are 
more preeminent with a lower dispersion, in terms of car- 
dinality. The boxplots of our methods are drawn thicker 
because the variation of the capacity of resulting clusters is 
tiny in quite a few cases. MCF fairlet shows the worst per- 
formance, followed by Vanilla fairlet and vanilla k-Medoids 
algorithm. 


5. CONCLUSION AND OUTLOOK 


The investigations of the fairness, capacity and explainabil- 
ity requirements in the learning analytics domain are the 
main goals of our research. In this paper, we present the 
challenges of our work with 3 research questions. The pre- 
liminary results on the fair-capacitated clustering problem 
show that our approaches can satisfy multiple objectives 
namely fairness, capacity and clustering cost. In the next 
step, we want to deploy the implementation of an explain- 
able fair clustering algorithm to achieve the clarification of 
the assignment in a fair clustering method. 
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APPENDIX 
A. DATASET 


Table 1: An overview of the datasets 


Dataset #instances attributes Protected attribute Balance score 
UCI student performance-Mathematics 395 33 Gender (F: 208, M: 187 ) 0.899 
UCI student performance-Portuguese 649 33 Gender (F: 383; M: 266) 0.695 
PISA test scores 3,404 24 Male (1: 1,697; 0: 1,707 ) 0.994 
OULAD 4,000 12 Gender (F: 2,000; M: 2,000) 1 
MOOC 4,000 21 Gender (F: 2,000; M: 2,000) 1 


B. UCISTUDENT PERFORMANCE DATASET 


a) Clustering quality (lower is better) 


b) Clustering fairness (higher is better) 
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Figure 1: Performance of different methods on UCI student performance dataset - Mathematics subject 
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ABSTRACT 


In this article, we take a look at digital social networks in 
education. The observation made on the campus of Norbert 
ZONGO University is that the digital device set up on the campus 
to support the learning and teaching process has not had the 
support of users who prefer social networks adapted to their 
smartphone. Most students use digital social networks for 
exchanges with their peers or teachers, especially with 
WhatsApp, Facebook about their courses. Yet these technologies 
are not designed for educational purposes. After a survey of 318 
students to take into account the needs of students and teachers, 
we propose to design an educational social network. This social 
network will be integrated into a distance learning platform under 
development as part of a project. We end by presenting the 
software architecture of our future educational social network. 


Keywords 


Educational social networks, CEHL, West Africa, University 


1. INTRODUCTION 


A Computing Environment for Human Learning (CEHL) is a 
computer environment whose purpose is to lead learners to 
develop one or more activities favorable to the achievement of 
educational objectives [1]. They are used to support or encourage 
learners in learning. The collaborative learning environment is an 
example of CEHL, designed to promote certain types of 
interactions including argumentation, explanations, conflict 
resolution etc. Many more examples of CEHL exist in the 
scientific literature. 


New research work on CEHL has emerged with the new 
capabilities offered by Internet and new communication and 
information technologies [2]. The use of social networks in 
education is a part of this new work. The educational platforms 
used to support teaching and learning are neglected in the profile 
of social networks including Facebook, WhatsApp etc. Yet these 
Rohadto Hessate, nbtédésigriEd Gueddugationdl haroses.Capus “To- 
wards a Conception and Integration of an Educational Social Net- 
work into an Institutional Learning Platform”. 2021. In: Proceed- 
ings of The 14th International Conference on Educational Data Min- 
ing (EDM21). International Educational Data Mining Society, 852-855. 
https://educationaldatamining.org/edm2021/ 

EDM ’21 June 29 - July 02 2021, Paris, France 


Social networks with an existing educational component often 
require a monthly or annual subscription and are often not 
designed in an institutional framework. Students feel the need to 
use digital platforms in their learning process, which pushes them 
towards these technologies that are less suited to their context. 


In view of this observation, we propose to design an educational 
social network (ESN) better suited to the West African context. 
This ESN could be integrated into the new system [3] proposed 
by a group of researchers with a profile of West African 
universities. 


In the rest of our work, we will present the context of our study 
then we will present the analysis of the needs carried out for the 
implementation of a new device and will end with the architecture 
of our future device. 


2 CONTEXT AND STATE OF THE ART 


In this section, we present the context of this research project. We 
also present a small preview on Facebook technology. We end 
this part by connecting social networks and computing 
environments for human learning. 


2.1 Context 


Computing Environments for Human Learning (CEHL) are used 
to stimulate and support learning among learners. Despite the 
multitude of learning / teaching platforms that exist, universities 
in African countries, particularly Norbert ZONGO University 
(previously University of Koudougou), have difficulty setting up a 
resource sharing platform better suited to their context. Some 
environments such as Moodle which is an online learning 
platform are being implemented in some universities in West 
Africa to support learning / teaching. This powerful platform is 
badly used or even abandoned by the first users. Teachers and 
students are sometimes tempted to use other technology such as 
WhatsApp [4]-[6] for learning / teaching purposes. To take into 
account their needs, researchers [3] propose to set up a system 
better suited to the West African context while keeping services 
existing. 


Students see themselves using other means to share, interact with 
their peers and teacher. Its means are among others through social 
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media such as: WhatsApp, Facebook, etc. Social media are used 
by students as the preferred tools for sharing, communicating, 
photographing parts of the course / TD. Its tools, easily accessible 
through their smartphones, are practically used on a daily basis. 
Although these media have a real advantage, accessibility and 
flexibility of use [7], it should be noted that they are not suitable 
for learning / teaching. These technologies are used to create 
groups without the knowledge of the pedagogy managers which 
makes it difficult to follow up, see its contribution in improving 
the learning / teaching of students. 


To have an CEHL adapted to the West African context, a project 
[3] is already underway to set up a digital device. Our problem 
comes in addition to this project but we are focusing more on 
social networks aspect. 


2.2 Curent situation 


2.2.1 Social networks 

First introduced by Australian anthropologist John A. Barnes [8], 
a social network is defined as a set of social interactions that unite 
a group of individuals. These social interactions can be: 
friendships, family ties, professional ties or specific ties. With the 
advent of Internet, the notion of social network took a new turn 
and gave birth to the notion of digital social network. Boyd and 
Ellison define digital social networks as "a web service allowing 
individuals to build a profile or not created by a combination of 
content and, on the other hand, to articulate this public profile 
with others" [9]. The most famous social networks these days are: 
Facebook, Twitter, LinkedIn, MySpace etc. these digital networks 
have drawn the attention of researchers [4], [7], [8] to its possible 
use in learning / teaching. 


2.2.2 Educational social networks 

An educational social network is first and foremost a social 
network. But unlike this one, the individuals in relationships are 
the learners and the teachers. It is a network that enables teacher- 
student and student-student, one-to-one, one-to-many and many- 
to-many interactions. Social networks occupy an important place 
nowadays in society and its more and more used by young people. 
This trend has prompted teachers to use these networks in the 
classroom [4], [6], [9] including Facebook, WhatsApp, etc. 


Some educational social networks have been developed to support 
and stimulate learning among pupils or students. We can cite: 


Learndia! is an educational social network and _ interactive 
learning space dedicated to students. It provides students with 
course content and a space to simulate assessments. In February 
2017, the founders announced the release of the desktop version 
which does not require an Internet connection. 


Freasyway* is an international educational social network for 
students, institutions, independent teachers. Beyond the fact that it 
offers interactive teaching, this platform offers students the 
possibility of obtaining information on the types of procedures to 
be carried out with institutions. 


Madabooky [10] is an educational social network targeting 
terminal and third grade students. Created by three young 
Madagascans, Tsira Louis Venceslas, Dada Manacé Sylvano and 


' https://learndia.com/ 


2 https://freasyway.com/public/ 


Haritiana Rabemanantsoa, after one of them failed the 
Baccalauréat exams many times. 


These social networks all have in common the objective of 
stimulating learning among pupils or students. However, these 
networks raise two major problems: 


=" Cost: Access to these platforms is conditioned by a 
subscription to the platform for a flat fee. The cost of 
these platforms is a barrier for students. In addition to 
this, the cost of the internet connection is an issue. The 
questionnaire found that 87% of students use the 
Internet connection of mobile operators (Orange, 
Telecel and Moov Africa). 


. Institutional scope: These educational social networks 
have been developed outside the institutions in charge 
of education (universities, colleges and high schools, 
training centre, institute, etc.). This causes two major 
problems, students do not always interact with their 
teachers and the syllabus may be not consistent with 
their own courses. 


Social networks such as Facebook, WhatsApp are widely used by 
students nowadays. These easily accessible technologies via 
smartphones are now used by learners and teachers to learn/teach 
[5], [6], [9]. These technologies, although used by students for 
consultation, collaboration, sharing and production activities in 
their learning processes, were not designed for pedagogical 
purposes. Moreover, access to user data for integration with 
existing educational platforms such as Moodle is problematic [6]. 
Furthermore, these platforms cannot be linked to institutional 
systems set up to support learning. Finally, the question of 
accessibility to user data arises with these applications. Yet these 
data are useful for monitoring learning and for educational 
research. In an article entitled "How WhatsApp makes 
Educational Data Mining difficult in West African 
universities"[6], the authors posed the difficulty for researchers in 
the field to have educational data for their research. 


Educational social networks designed to support learning are 
inspiring solutions but they do not address the concerns observed 
on the campus of the Norbert ZONGO University. The design of 
these educational social networks is not adapted to the system set 
up in the universities of Burkina Faso. The cost of these 
applications is a major problem. 


In view of this, we proposed an appropriate solution. The 
following section presents our proposal. 


a PROPOSED SOLUTION 


This section of our paper deals with the analysis of students’ needs 
following a survey conducted on a sample of 318 students from 
public and private universities in Burkina Faso. We propose a 
new learning device and present its architecture. 


3.1. ‘Description of the questionnaire 

To understand student practices on campus, we conducted a 
survey with a student questionnaire. This study concerned 318 
students from public and private universities in Burkina Faso. 


The questionnaire consists of four parts: student identification, 
access to communication and information technologies (ICT), use 
of ICT and use of educational platforms. 
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The identification of the student made it possible to collect the 
demographic social information of the students (sex, age group, 
sector, university, etc.). The section, Access to ICT, allowed us to 
collect information on the types of electronic support available to 
students, on Internet access at the University and at home. The 
last two sections of our questionnaire concerned the use made by 
students of ICT and educational platforms. 


3.2 Needs analysis and specifications 

To better understand student practices, we conducted a survey of 
318 students from public and private universities in Burkina Faso. 
This survey allowed us to capture the needs of the students. 


Users: Teachers and students are the future users of our solution. 
Indeed, the network should connect students of the same class and 
their teachers. Also, students should be able to send invitations to 
other students (their elders for example). As for teachers, they 
should be able to send and receive invitations from their 
colleagues and students. An administrator should ensure the 
proper functioning of the system. 


Analysis of student needs: The survey revealed the following 
student needs, namely communication, sharing, mutual support: 


Communication is the major problem for students. The number 
of students per teacher is very high so that the students are not 
satisfied with explanations in class. There is not anyway to contact 
or ask questions of the teacher outside of the classroom. The 
students noted that they have class WhatsApp groups to 
communicate. Although they have this communication tool, their 
teachers are not included in these groups. 


Sharing files, lessons, tutorials, exercise solutions and tutorials 
are also a major concern for students. The WhatsApp group serves 
as their tools for sharing but the storage problem is posed. The 
data of these groups are quickly deleted if the storage disk is full. 


Mutual aid appears important when an exercise or another part of 
a course is misunderstood. Students use Web searches. 


As a result of this analysis, our solution should be able to allow 
students and teachers to make the following actions grouped in 
this table. 


Requirements specification 


Student Teacher Admin 


=Manage an account |*"Manage anaccount # Manage accounts 


=Send invitations "Send invitations 
"Ask questions = Answer questions 
= Answer questions = Share information 


= Create discussion 
topic 


=Share information 


Tableau 1:requirements specification 


3.3 Functional architecture of the new 


device 
Figure 1 shows the software architecture that we have chosen for 
the development of the future device. 


Social Network 


= EA : EIAH 


+ Social network 


University 


Local network 


Students Teacher 


Figure 1: Software architecture 


The new device complements existing platforms. This figure 
below presents an overview of our future system. The users will 
have access to the social network by Internet or by a local 
network. To respond to the difficulties related to the accessibility 
of the Internet connection on campus, students will be able to 
access the social network through a local network. It will be 
accessible via smartphones, tablets, laptops and desktops. 


This device will be powered by text, audio, video, image and 
podcast data from students and teachers. 


4. Conclusion 

In this paper, we have dealt with the establishment of a social 
network for students from West Africa, particularly those of 
Burkina Faso. We have presented the background and the 
objectives of this research. To better understand and take into 
account the needs of students, we conducted a survey with a 
questionnaire on 318 students. The results of this questionnaire 
allowed us to highlight the needs and expectations of students for 
a better learning environment. We proposed to design an 
educational social network more suited to the learning context of 
students in West African universities. We presented the 
architecture of our future device. Our goal in this research is to 
create a free and open source platform that will create a 
community of researchers around this theme. 


The work already done and presented in this paper is a first step of 
this research. The next step will concern the design of this new 
device. We plan to evaluate our proposed learning system at two 
levels: pedagogical, technical. To do this, we will enlist experts in 
education science to our research team. 
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ABSTRACT 


Automated essay scoring (AES), where natural language 
processing is applied to score written text, can underpin ed- 
ucational resources in blended and distance learning. AES 
performance has typically been reported in terms of correla- 
tion coefficients or agreement statistics calculated between 
a system and an expert human examiner. We describe the 
benefits of alternative methods to evaluate AES systems 
and, more importantly, facilitate comparison between AES 
systems and expert human examiners. We employ these 
methods, together with multi-marked test data labelled by 
5 expert human examiners, to guide machine learning model 
development and selection, resulting in models that outper- 
form expert human examiners. 


We extend on previous work on a mature feature-based lin- 
ear ranking perceptron model and also develop a new multi- 
task learning neural network model built on top of a pre- 
trained language model — DistilBERT. Combining these two 
models’ scores results in further improvements in perfor- 
mance (compared to that of each single model). 


Keywords 
Student Assessment, Metrics, Evaluation, Automated Essay 
Scoring, Natural Language Processing, Deep Learning 


1. INTRODUCTION 


Automated essay scoring (AES) is the task of employing 
computer technology to score written text. Learning to 
write a foreign language well requires a considerable amount 
of practice and appropriate feedback. On the one hand, 


@istein E. Andersen, Rebecca Watson, Zheng Yuan and Kevin Yet Fong 
Cheung “Benefits of alternative evaluation methods for Automated Essay 
Scoring”. 2021. In: Proceedings of The 14th International Conference on 
Educational Data Mining (EDM21). International Educational Data Mining 
Society, 856-864. https://educationaldatamining.org/edm2021/ 

EDM ’21 June 29 - July 02 2021, Paris, France 


AES systems provide a learning environment in which for- 
eign language learners can practice and improve their writ- 
ing skills even when teachers are not available. On the other 
hand, AES reduces the workload of examiners and enables 
large-scale writing assessment. In fact, these technologies 
have already been deployed in standardised tests such as 
the TOEFL and GMAT [7, 6] as well as in a classroom set- 
ting [26]. 


As English is one of the world’s most widely used languages, 
and learners naturally outnumber teachers, AES systems 
aimed at ‘English as a Second or Other Language’ (ESOL) 
are in high demand. Consequently, there is a large body of 
literature with regards to AES systems of text produced by 
ESOL learners [20, 3, 5, 28, 2, 30, 1, 23, 16], overviews of 
which can be found in various studies [25, 22, 15]. 


AES systems exploit textual features in order to measure 
the overall quality and assign a score to a text. The earli- 
est systems used superficial features, such as essay length, 
as proxies for understanding the text. As multiple factors 
influence the quality of texts, later systems have used more 
sophisticated automated text processing techniques to ex- 
ploit a large range of textual features that correspond to 
different properties of text, such as grammar, vocabulary, 
style, topic relevance, and discourse coherence and cohesion. 
In addition to lexical and part-of-speech (PoS) n-grams, lin- 
guistically deeper features such as types of syntactic con- 
structions, grammatical relations and measures of sentence 
complexity are some of the properties that form an AES 
system’s internal marking criteria. The final representation 
of a text typically consists of a vector of features that have 
been manually selected and tuned to predict a score on a 
marking scale as accurately as possible, an approach which 
has involved extensive work on feature development and op- 
timisation. 


In contrast, the most recent AES systems are based on neu- 
ral networks that learn the feature representations automat- 
ically, without the need for this kind of manual tuning [1, 
23, 19, 16, 27]. Taking the sequence of (one-hot vectors of 
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Figure 1: Data distributions (0-20 score on z-axis, count on y-axis). Left to right: Full training set (98,138 responses), u400 


training set (14,966), test set (364). 


the) words in an essay as input, Alikaniotis et al. [1] and 
Taghipour et al. [23] studied a number of neural architec- 
tures for the AES task and determined that a bidirectional 
Long Short-Term Memory (LSTM) [14] network was the 
best performing single architecture. With recent advances in 
pre-trained bidirectional Transformer [24] language models 
such as Bidirectional Encoder Representations from Trans- 
formers (BERT) [11], pre-trained language models have been 
applied for AES to achieve state-of-the-art performance [19, 
16]. 


The B2 First exam, formerly known as Cambridge English: 
First (FCE), is a Cambridge English Qualification that as- 
sesses English at an upper-intermediate level. We extend a 
mature state-of-the-art feature-based AES system [5, 28, 2], 
researched and developed over the last decade using Cam- 
bridge English’s FCE exam answers and their corresponding 
operational scores as training data. Further, we develop a 
new multi-task learning (MTL) neural network model built 
on top of a pre-trained masked language model — Distil- 
BERT [21]. 


Various evaluation metrics have been used to evaluate AES 
systems, including correlation metrics such as Pearson’s Cor- 
relation Coefficient (PCC) and Spearman’s Correlation Co- 
efficient (SCC), agreement metrics like quadratic weighted 
Kappa [8] (QWK) and quadratic agreement coefficient [13] 
(AC2), and error metrics such as Mean Absolute Error (MAE) 
and Mean Square Error (MSE). 


We introduce novel evaluation methods that employ multi- 
marked test data, where each test item has been labelled by 
more than one expert human examiner, to facilitate compar- 
ison of human and AES system performance. Our methods 
aim to recognise that the set of examiner scores per answer 
represent an acceptable range of scores and thence we aim to 
evaluate AES systems against this set of scores rather than 
against a single gold standard score or via inter-rater agree- 
ment metrics. This is an important distinction given that 
expert examiner performance represents the upper bound on 
the AES task. To the best of our knowledge, this is the first 
work to perform an in-depth comparison of feature-based 
and neural-based AES model performance. Further, we il- 
lustrate that these models can be considered complementary, 
and combined to improve performance. 
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2. DATA 


We employ a large training set, collected by Cambridge 
Assessment, comprising almost 50,000 FCE examination 
scripts from 2016-20 with operational scores, as well as a 
newly created multi-marked test set containing 182 scripts 
labelled by 5 expert human examiners.” Each script con- 
sists of two questions, and responses are scored using 4 fine- 
grained assessment scales: content, communicative achieve- 
ment, organisation and language. Each scale provides a 
score between 0 and 5 inclusively, and the overall score is 
calculated by summing over these 4 individual scales to pro- 
vide an answer score in the range 0-20. For this AES task, 
we employ the overall 0-20 score to train and test models.? 


The full training set contains almost 100,000 individual re- 
sponses to over 50 different prompts, all labelled with a score 
in the range 0-20, but with an uneven distribution strongly 
concentrated around 14 (the score expected by an average 
learner having attained the B2 level for which the exam is 
designed). In order for the multi-marked test set to include 
as wide a range of responses as possible, 182 scripts (each 
consisting of two answers) were sampled to provide a more 
uniform distribution of scores in the range 16—40 as well as a 
certain number of lower scores (scripts with scores 0-15 are 
rarely seen since they correspond to a level far below the one 
required to pass the exam); the 364 individual answers show 
a relatively uniform distribution of scores above 8. Similarly, 
a more balanced training set of just under 15,000 answers 
was extracted from the full training set by excluding super- 
numerary scripts from the middle of the scale; u400.* The 
resulting distributions can be seen in Figure 1. 


3. METRICS 


3.1 Traditional Metrics 


Yannakoudakis & Cummins [29] investigated the appropri- 
ateness and efficacy of evaluation metrics for AES including 


‘https: //www.cambridgeassessment.org.uk/ 

?The operational score, combined with 5 examiner scores, 
results in 6 scores per answer in the test data. In contrast, 
the training data contains a single operational score. 
3Previously, Yannakoudakis et al. [28] worked at the script 
level (i.e. across two answers) and therefore used scores in 
the range 0-40. 

‘Note: u400 was selected to be uniformly distributed at the 
script-level; with 400 randomly selected (maximum) scripts 
for each script score level 0-40. 
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SCC, PCC, QWK and AC2 under different experimental 
conditions. They recommend AC2 [13] for evaluation and 
reporting SCC and PCC measures for error analysis and 
system interpretation. Therefore we report these three eval- 
uation metrics (AC2, SCC, PCC), as well as RMSE which 
we consider operationally desirable; it penalises larger errors 
more than smaller errors. 


Ke & Ng [15] provide a survey of AES system research and 
popular public corpora employed in evaluation. Most public 
corpora contain a single human annotator score and evalu- 
ation is limited to considering this score the gold standard 
thence evaluation aids in comparison of AES systems but it 
is not possible to determine a reasonable upper bound on 
the task. 


The CLC-FCE dataset [28] and the Automated Student As- 
sessment Prize (ASAP) corpus, released as part of a Kaggle 
competition,” include scores assigned by four and two hu- 
man annotators, respectively. For these, multi-marked cor- 
pus evaluation can be performed against a single reference 
score by taking an average of the scores [1, 16].° Alterna- 
tively, agreement between the AES system and (each) hu- 
man expert can be compared to inter-rater agreement per- 
formance (which represents the upper bound the task) [28, 
19]. Yannakoudakis et al. [28] calculate the average pair-wise 
agreement across all markers (human examiners and AES 
system) to produce a single (comparable) metric for SCC 
and PCC.We perform inter-rater and rater-to-AES pair-wise 
evaluations for SCC, PCC, AC2 and RMSE in our experi- 
mentation, and determine the average performance across 
the 5 expert human examiners. 


3.2. Multi-marked Metrics 


We also employ a novel evaluation method whereby scores 
are only considered to be erroneous if they fall outside the 
acceptable range of scores, as defined by the set of expert 
human examiner scores considered. We consider two score 
ranges: i) the range of 5 expert examiner scores (ALL) and 
ii) a narrower range (MID3) where we remove the top and 
bottom scores (for each test item). In addition, we report 
performance achieved for each of these ranges after removing 
a single examiner’s score from the range, in turn, so that 
we can compare the performance of each expert examiner 
against the AES models. 


Given a score range, we report the accuracy (percentage of 
scores that fall within the range) and a novel RMSE variant; 
RMSE®, which considers the size of the error as equal to the 
distance between the score and the range. For example, if 
a score falls above the range we calculate the error as the 
difference between the score and the highest score in the 
range. 


3.3. RMSE, Graphs 


Operationally, the best performing model may not necessar- 
ily be one that achieves the highest performance value based 


https: //www.kaggle.com/c/asap-aes 
®For ASAP, the resolved score is often employed, which is 
calculated as the average between the two human examiner 
scores (if the scores are close), or is determined by a third 
examiner (if the scores are far apart). 


on single metric such as AC2. Rather, a model that performs 
well across the assessment scale is preferable. Further, it is 
possible for models to achieve similar (single) metric perfor- 
mance but exhibit very different performance distributions 
across the scale (cf. uniform vs non-uniform distributions 
with the same average). 


Baccianella et al. [4] argued that macro-averaged metrics, in- 
cluding macro-averaged root mean squared error (RMSE™), 
are more suitable for ordinal regression tasks. RMSE™ is 
calculated by averaging over RMSE. (RMSE determined for 
each score c on the assessment scale). That is, RMSE, is 
RMSE calculated over the subset of test items that are la- 
belled c. They argue that macro-averaged metrics are more 
robust to test set distribution given the average results in 
equally weighting the error rate for each label in the assess- 
ment scale. Therefore, we report the RMSE™ metric. 


We also want to explicitly analyse how a model performs 
across the assessment scale. Therefore, we employ individ- 
ual RMSE, measures, for each reference score c (0-20), and 
produce novel graphs; RMSE, graphs, where the score (c) is 
plotted on the z-axis and the RMSE, value is plotted on the 
y-axis. We also produce RMSEF graphs, where we calculate 
RMSE, values based on our novel RMSE® variant. 


4. AES MODELS 
4.1 Feature-based 


In this work, we extend a mature feature-based AES model [5, 
28, 2]: a ranking timed aggregate perceptron (TAP) model 
trained on a set of features shown to encode the information 
required to distinguish between texts exhibiting different 
levels of language proficiency attained by upper-intermedite 
learners. Features include ones that can be extracted di- 
rectly from the text (word and character n-grams) or a 
parsed representation (PoS n-grams and parse rule names), 
as well as various statistics (PoS categories, lengths, read- 
ability scores, use of cohesive devices, etc.) and error es- 
timations (rule-based and corpus-based). We also include 
features that measure congruence between question and an- 
swer (similarity between embeddings for different parts), but 
that is not the focus of this paper. 


Unlike for models used in previous work, the n-gram features 
have been filtered to exclude ones that encode punctuation 
without context; this forces the model to focus on other, pos- 
sibly more relevant, aspects of the text and at the same time 
removes the possibility of artificially inflating model scores 
by adding superfluous punctuation characters. The models 
trained on the full and u400 training sets will be referred to 
as the TAP and TAP}, respectively, in the following. 


4.2 Neural Network 


In recent years, fine-tuning pre-trained masked language 
models like BERT via supervised learning has become the 
key to achieving state-of-the-art performance in various nat- 
ural language processing (NLP) tasks. These models often 
consist of over 100 million parameters across multiple layers 
and have been pre-trained on large amounts of existing text 
data to capture context-sensitive meaning of, and relations 
between, words. Following [19, 16], our neural approach 
builds upon this, where we use pre-trained DistilBERT as 
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Table 1: Average inter-rater and rater-to-AES performance (Ex1l—Ex5) 


Op Exl Ex2 Ex3 Ex4 Ex5 | TAP TAP, NN TAP+NN- TAP |+NN 
SCC 0.74 | 0.77 0.72 0.75 0.74 0.77 | 0.75 0.74 0.78 0.79 0.78 
PCC 0.73 | 0.76 0.69 0.76 0.75 0.76 | 0.74 0.73 0.78 0.78 0.77 
AC2 0.90 | 0.92 0.92 0.94 0.94 0.94 | 0.94 0.93 0.94 0.94 0.94 
RMSE | 2.74 | 2.41 2.44 2.19 2.19 2.25 | 2.20 2.21 2.09 2.08 2.05 
Table 2: RMSE using average examiner (Ex1—Ex5) scores Table 3: Accuracy for ALL range. 
(ExAvg). 
-Exl -Ex2 -Ex3 -Ex4 = -Ex5 
TAP TAP; NN TAP+NN- TAP|+NN Op 61.3 | 54.1 55.5 56.0 56.0 59.3 
RMSE 1.70 1.72 1.58 1.56 1.52 Ex1 * 73.4 * * * * 
RMSE | 1.70 1.34 1.55 1.54 1.33 Ex2 * * 69.0 * * * 
Ex3 * * * 76.4 * * 
Ex4 * * * * 73.6 x 
the basis for our neural network model and add additional ee * * s i = ane 
1 top to perform supervised tasks. We choose Distil- ae Ba || el ee eer roe ae 
ee eee is TAP 78.8 | 71.4 72.8 73.9 74.7 76.1 
BERT for practical reasons — it retains 97% of the language NN 81.0 | 75.0 76.9 76.1 76.4 78.0 
understanding capabilities of BERT, while reducing param- TAP+NN 84.9 | 78.8 79.1 79.9 81.0 82.4 
eter size by 40% and decreasing model inference time by TAPi+NN | 85.4 | 77.5 80.8 80.5 80.5 82.1 
60% [21]. 
Table 4: Accuracy for MID3 range. 
We treat AES as a sequence regression problem and con- 
struct the input by adding a special start token ([CLS]) to -Exl -Ex2) -Ex3 -Ex4 — -Ex5 
the full text: Op 36.0 | 25.5 27.2 264 28.3 26.9 
Exl * 46.2 * * * * 
[CLS], wi, wa,..-, Wey---,Wn (1) Ex2 * * 43.1 * * * 
Ex3 * * * 42.9 * * 

. eee i Ex4 * * * * 40.9 * 
This representation is then used as input to the output layer Ex5 mi ‘i ‘ * i 50.0 
to perform regression. TAP 59.9 | 46.4 49.5 45.1 49.2 46.7 

TAP, 53.6 | 44.5 45.6 41.5 41.5 42.6 
Compared with feature-based models, for neural network NN 58.2 | 43.4 45.9 42.6 43.7 43.7 
models to be effective, they need to be trained on a large TAP+NN | 61.8 | 47.0 49.2 46.7 47.30 48.1 
amount of annotated data. MTL allows models to learn from ASE TSDUN [09:01] BGS. POT 29 SRD =e 


multiple objectives via shared representations, using infor- 
mation from related tasks to boost performance on tasks for 
which there is limited target data [18, 10, 31, 9]. Instead of 
only predicting the score of an essay, we extended the model 
to incorporate auxiliary objectives. The information from 
these auxiliary objectives is propagated into the weights of 
the model during training, without requiring the extra la- 
bels at testing time. Inspired by the linguistic features used 
in the feature-based AES systems, we experimented with a 
number of linguistic auxiliary tasks, and identified the de- 
pendency parsing as the most effective one. 


The neural AES model is developed as a MTL neural net- 
work model trained jointly to perform AES and Grammat- 
ical Relation (GR) prediction. Model weights are shared 
among these two training objectives. The final layer for 
the AES objective is a fully connected layer that performs 
regression (i.e. scoring head), while another linear layer is 
introduced to perform token-level classification to predict 
the type of the GR in which the current token is a depen- 
dent (i.e. classification head). The overall loss function is a 
weighted sum of the essay scoring loss (measured as MSE) 
and the dependency parsing loss (as cross-entropy): 


Loss = A Lossars + (1 — A)Losser (2) 


During training the whole model is optimised in an end-to- 
end manner. We refer to the neural MTL model trained on 
the full training set as the NN model in Section 5. 


5. EVALUATION 


To facilitate comparison between AES systems and human 
examiners, we employed traditional evaluation metrics as de- 
scribed in §3.1. Table 1 shows average inter-rater or rater-to- 
AES performance in terms of SCC, PCC, AC2 and RMSE 
calculated between 1) operational scores (Op), scores as- 
signed by an expert (Ex1—Ex5) or scores predicted by an 
AES system, and 2) each of the experts’ scores (excluding 
the expert being evaluated, if any).” For instance: 


§CC(Ex3) = — > SCC(Ex3, Exi) (3) 
i143 


For each metric (row) in Table 1, we have highlighted the 
best performance in bold. AC2 scores 7 of the 10 models 
the same (top) score of 0.94 and thence, in our experimenta- 
tion, does not aid in system comparison. Apart from AC2, 
these traditional evaluation metrics indicate that the NN 
model outperforms all examiners and feature-based (TAP) 
models. Both TAP models perform comparatively to the 
individual examiners, that is, fall in the performance range 
achieved by examiners (Ex1—Ex5). Performance of the com- 
bined TAP and NN models (the average score) is shown in 
the last two columns of Table 1. Based on these traditional 


"For interested readers, we have included pair-wise results 
for SCC, PCC, AC2 and RMSE metrics in the Appendix. 
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Table 5: RMSE® for ALL range. 


-Exl -Ex2 -Ex3 -Ex4 = -Ex5 
Op 1.35] 1.48 1.46 1.49 1.46 1.43 
Ex1 * Asi? * a * * 
Ex2 * * 1.16 * * * 
Ex3 * * * 0.77 * * 
Ex4 * * * * 0.78 * 
Ex5 * * * * * 0.93 
TAP 0.74 | 0.90 0.92 0.82 0.84 0.79 
TAP, 0.71 | 0.87 O85 0.83 0.83 0.81 
NN 0.64 | 0.81 0.74 0.76 0.77 0.70 
TAP+NN 0.62 | 0.79 0.76 0.73 0.74 0.68 
TAPi+NN | 0.58 | 0.74 0.68 0.68 0.68 0.65 


Table 6: RMSE® for MID3 range. 


-Exl -Ex2 -Ex3 -Ex4 = -Ex5 
Op 1.84] 2.11 2.03 2.12 2.04 2.04 
Ex1 * Lovey * * * * 
Ex2 * * 1.77 x * * 
Ex3 * * * 1.42 * * 
Ex4 * * * * 1.41 * 
Ex5 # * * * a 1.48 
TAP E21 | LAL 149 1.42 155 1.42 
TAP, 1.21 | 1.51 144 1.43 152 1.46 
NN 1.09 | 1.38 1.31 1.33 1.40 1.31 
TAP+NN 1.08 | 1.32 1.32 1.30 1.41 1.28 
TAP,+NN | 1.01 | 1.31 1.23 1.25 1.34 1.25 


metrics, it is unclear whether combining models improves 
performance. PCC and AC2 indicate no improvement is 
made over the single NN model, while SCC and RMSE in- 
dicate that TAP+NN and TAPi+NN are best, respectively. 


Table 2 compares the AES systems using RMSE and RMSE™ 
calculated using the average examiner scores (ExAvg) as the 
single reference score. The combined TAP1+NN achieves 
the best RMSE and RMSE™ performance (in line with av- 
erage examiner RMSE performance in Table 1). RMSE™ is 
the only metric that illustrates a large performance differ- 
ence between TAP and TAP models. In fact, TAP: sig- 
nificantly outperforms the NN model as well for this metric, 
indicating that this model performs better across the assess- 
ment scale than the other AES models. RMSE and RMSE™, 
over ExAvg scores, suggest that there is some small perfor- 
mance gains made by combining models. 


In addition to traditional evaluation methods, we employed 
novel multi-marked metrics, as described in §3.2. Tables 3 
and 4 illustrate the accuracy (percentage of scores that fall 
in range) over the ALL and MID3 ranges, respectively. Ta- 
bles 5 and 6 show the corresponding RMSE® performance 
for these ranges, respectively. For all four tables, perfor- 
mance is directly comparable within each column, with the 
highest accuracy highlighted in bold.* The most important 
evaluation relates to the first column for the ALL range in 
Tables 3 and 5, as these results compare the performance 
of the AES models evaluated against all 5 examiner scores’ 
range. Other columns in these tables (-ExN) facilitate com- 
parison between the AES systems and each human examiner 
(N). 


8Note, the asterisk symbol in these four tables indicate that 
the score is part of the acceptable range. 


Accuracy and RMSE" metrics are complementary, as ac- 
curacy represents the proportion of scores that are correct 
while RMSE® evaluates the degree to which scores fall out- 
side the range of human examiner scores. Operationally, we 
consider RMSE" more important than accuracy, given AES 
systems should be consistent and errors, when they do oc- 
cur, should be penalised to a greater degree as the scores 
falls further outside the range of human examiner scores. 


Tables 5 and 6 suggest that NN outperforms both TAP 
models and all human examiners, while both TAP mod- 
els perform comparatively to the individual examiners; in 
line with evaluation based on traditional metrics in Table 1. 
However, in contrast to the metrics discussed thus far, the 
RMSEF metric indicates combined models outperform their 
corresponding individual models. This improvement is more 
evident for TAPi+NN, which outperforms all human exam- 
iners and AES models across both ranges. 


As described in §3.3, we produced novel RMSE, graphs 
to compare model performance across the assessment scale. 
RMSE, (and RMSE®) graphs for the single and combined 
AES models are shown in Figure 2. The Op and ExAvg 
graphs plot RMSE. calculated against the operational and 
average examiner scores (i.e. c on the x-axis), respectively. 
The bottom graph, a RMSEE graph, plots the RMSE® per- 
formance for the ALL range where the c score (x-axis) is 
the average examiner score in the ALL range (i.e. using the 
same distribution of test items as the ExAvg RMSE, graph). 


Comparing the AES models across the assessment scale, we 
can see that all AES models follow a similar pattern; they 
perform better in the mid ranges and worse in the lower and 
upper score ranges. This finding is not unexpected, given we 
have ample training data in the mid ranges and very little 
training data in the upper and lower ranges of the assessment 
scale (see Figure 1). The TAP; model, trained over a more 
uniformly distributed training set trades smaller declines in 
performance in the middle of the scale for more consistent 
results across the scale, in line with the RMSE™ evaluation 
metric. The NN model achieves better performance in the 
upper and lower scores compared to TAP, suggesting that it 
is more robust over skewed training datasets. However, as 
evident in these RMSE, graphs, the TAP and NN models 
tend to perform better in particular ranges of the scale and 
thence these models are complementary, and combined mod- 
els benefit from the relative strengths of individual models 
across the scale. 


6. CONCLUSIONS 


We deployed two types of AES systems: feature-based and 
neural network. We found that the NN model is more ro- 
bust over skewed datasets as it achieves better performance 
in the upper and lower scores. However, the feature-based 
models are more interpretable, require significantly less com- 
putational overhead to train and can be trained over much 
smaller datasets than neural-based models. The TAP; model, 
trained over a more uniform subset of the training data per- 
formed more consistently than NN across the assessment 
scale. We illustrated that feature-based TAP and NN mod- 
els are complementary, and combined models benefit from 
the relative strengths of individual models across the scale, 
outperforming human examiners. In operational deploy- 
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ment, the best performing TAP:i+NN model can make ef- 
fective use of the constantly growing set of training data by 
retraining TAP; frequently to incorporate any new informa- 
tion available and only retraining the NN models over the 
full training set from time to time. 


We presented novel approaches to evaluating AES that make 
use of multi-marked/annotated data. These approaches have 
advantages over traditional evaluation methods and also demon- 
strate the value of using resources to repeatedly annotate 
essays for the AES context. Building on the recommenda- 
tions made by Yannakoudakis & Cummins [29], we make the 
following observations and suggestions for those working on 
AES: 


Op 


e In addition to RMSE™, we recommend calculating RMSE. 
and plotting RMSE, graphs to explicitly analyse how 
system performance varies across an assessment scale. 


e We recommend that, where feasible, a proportion of 
texts in evaluation sets should be annotated by mul- 
tiple examiners to allow different forms of evaluation 
that account for rating variability exhibited by human 
examiners. 


e Where multiple human-derived scores are available, 
system performance should be evaluated using meth- 
ods that incorporate the range of scores given for each 
text. We recommend using a novel RMSE variant; 
RMSE®, that considers the size of the error as equal 
to the distance between the score and the upper or 
lower bound of the range. 


ExAvg 


e Where multiple human-derived scores are available, we 
also recommend that the accuracy of a system is cal- 
culated, by treating texts scored within the range of 
scores provided by humans as correct classifications. 


Further work is needed to explore the evaluation approaches 
a TAR SUSE ge NIN see TAR NN tee proposed here to establish how they vary in different con- 
20] texts, to inform how they should be interpreted. For ex- 
ample, we expect these evaluation metrics to behave differ- 
ently according to the granularity of the reporting scale, the 
distribution of evaluation sets and the inter-rater reliabil- 
ity observed between human examiners. Therefore, work to 
systematically investigate these measures in terms of their 
robustness to trait prevalence, robustness to marginal homo- 
geneity and robustness to scale scores should be conducted 
systematically, in a similar vein to simulations reported by 
Yannakoudakis & Cummins [29]. 


We have demonstrated the value of producing multi-marked 
data to support evaluation. However, our proposed metrics 
can be refined further to allow for more sophisticated uses 
of multi-marked data, by incorporating methods commonly 
Se TAP See TAP I st NN. os TAP-LNN 29> TAPLENN used for psychometric evaluation and quality assurance, such 
as Many-Facet Rasch Measurement [17, 12]. Further work 
should explore how these methods can account for examiner 
reliability issues when making use of multi-marked data. 


Figure 2: RMSE. graphs for operational score (Op) and 
average examiner score (ExAvg). RMSE® graph for the 
ALL range. 
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APPENDIX 
A. FULL PAIR-WISE RESULTS 


We include, in the Appendix, individual pair-wise inter-rater and rater-to-AES performance, across the 5 examiners, for 
operational scores (Op), each human examiner (Exl—Ex5) and the AES models for SCC, PCC, AC2 and RMSE. Results in 


the last row in each table, the average of the Ex1—Ex5 scores in each column, can be seen in Table 1 . 


Table 7: SCC (best score per row shown in bold). 


Op | Exl Ex2 Ex3 Ex4 Ex5 | TAP TAP; NN TAP+NN_ TAP|+NN 
Op * 0.76 0.69 0.76 0.72 0.75 | 0.73 0.73 0.79 0.77 0.77 
Avg (Exl-Ex5) | 0.74 | 0.77 0.72 0.75 0.74 0.77 | 0.75 0.74 0.78 0.79 0.78 

Table 8: PCC (best score per row shown in bold). 

Op | Exl Ex2 Ex3  Ex4 Ex5 | TAP TAP; NN TAP+NN_ TAP|+NN 
Op * 0.75 0.68 0.76 0.73 0.72 | 0.73 0.74 0.77 (0.77 0.77 
Exl 0.75 * 0.66 0.79 0.76 0.82 | 0.79 0.79 0.83 0.83 0.83 
Ex2 0.68 | 0.66 * 0.71 0.70 0.68 | 0.68 0.65 0.69 0.70 0.69 
Ex3 0.76 | 0.79 0.71 * 0.76 0.79 | 0.75 0.73 0.80 0.80 0.79 
Ex4 0.73 | 0.76 0.70 0.76 * 0.77 | 0.73 0.74 0.76 0.77 0.77 
Ex5 0.72 | 0.82 0.68 0.79 0.77 * 0.76 0.76 0.81 0.81 0.80 
Avg (Exl-Ex5) | 0.73 | 0.76 0.69 0.76 0.75 0.76 [0.74 0.73 0.78 0.78 0.77 

Table 9: AC2 (best score per row shown in bold). 

Op | Exl Ex2 Ex3 Ex4 Ex5 | TAP TAP; NN TAP+NN_ TAP|+NN 
Op * 0.90 0.88 0.91 0.89 0.90 | 0.88 0.89 0.89 0.89 0.90 
Exl 0.90 * 0.90 0.93 0.93 0.94 | 0.93 0.94 0.94 0.94 0.95 
Ex2 0.88 | 0.90 * 0.94 0.92 0.93 | 0.92 0.90 0.92 0.92 0.92 
Ex3 0.91 | 0.93 0.94 * 0.95 0.95 | 0.94 0.94 0.95 0.95 0.95 
Ex4 0.89 | 0.93 0.92 0.95 * 0.95 | 0.94 0.93 0.94 0.94 0.94 
Ex5 0.90 | 0.94 0.93 0.95 0.95 * 0.94 0.94 0.95 0.95 0.95 
Avg (Exl-Ex5 0.90 | 0.92 0.92 0.94 0.94 0.94 | 0.94 0.93 0.94 0.94 0.94 

Table 10: RMSE (best score per row shown in bold). 

Op | Exl Ex2 Ex3 Ex4 Ex5 |TAP TAP; NN TAP+NN- TAP|+NN 

Op * 2.72 2.92 2.58 2.72 2.78 | 2.93 2.71 2.74 2.79 2.64 


Ex1 


2.29 


2.15 


2.05 


2.10 


1.99 


Ex2 2.92 | 2.77 * 2.30 2.20 2.48 | 2.22 2.40 2.26 2.17 2.24 
Ex3 2.58 | 2.30 2.30 * 2.08 2.07 | 2.20 2.24 2.06 2.07 2.05 
Ex4 2.72 | 2.30 2.20 2.08 * 2.15 | 1.95 2.01 1.90 1.85 1.84 
Ex5 2.78 | 2.28 2.48 2.07 2.15 * 2.34 2.25 2.20 2.21 2.13 
Avg (Exl-Ex5) | 2.74] 2.41 2.44 2.19 2.19 2.25 12.20 2.21 2.09 2.08 2.05 
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ABSTRACT 


Students using self-directed learning platforms, such as Duolingo, 
cannot be adequately assessed relying solely on responses to 
standard learning exercises due to a lack of control over learners’ 
choices in how to utilize the platform: for example, how learners 
choose to sequence their studying and how much they choose to 
revisit old material. To provide accurate and well-controlled 
measurement of learner achievement, Duolingo developed two 
methods for injecting test items into the platform, which 
combined with Educational Data Mining techniques yield insights 
important for product development and curriculum design. We 
briefly discuss the unique characteristics and advantages of these 
two systems - Checkpoint Quiz and Review Exercises. We then 
present a case study investigating how different study approaches 
on Duolingo relate to learning outcomes as measured by these 
assessments. We demonstrate some of the unique benefits of these 
systems and show how educational data mining approaches are 
central to making use of this assessment data. 


Keywords 


online learning; language learning; assessment; regression 


1. INTRODUCTION 


Online learning platforms have at their disposal large volumes of 
data about how students engage with learning material, how they 
navigate educational software, and how the learning process 
unfolds over time. Using a variety of methods - machine learning, 
statistics, psychometrics, etc. - Educational Data Mining (EDM) 
and Learning Analytics (LA) researchers identify students at risk 
of dropout from a course [e.g., 13], detect changes in study 
behavior [e.g., 11], predict exam performance [e.g., 1, 4, 12], and 
characterize the different learning strategies that learners adopt 
[e.g., 1, 12]. 


Duolingo is a learning platform that provides free language 
education through mobile apps and a website. With around 40 
million users active on the platform each month, Duolingo may 


Lucy Portnoff, Erin Gustafson, Joseph Rollinson and Klinton Bicknell 
“Methods for Language Learning Assessment at Scale: Dulingo Case 
Study”. 2021. In: Proceedings of The 14th International Conference on 
Educational Data Mining (EDM21). International Educational Data Min- 
ing Society, 865-871. https://educationaldatamining.org/edm2021/ 

EDM ’21 June 29 - July 02 2021, Paris, France 


well possess the largest language learning dataset of any company 
or research institution. Researchers at Duolingo leverage 
EDM/LA methodologies to mine datasets - including internal 
assessment and log data - for insights that inform improvements to 
the learning experience, help identify opportunities for changes to 
curriculum design, and fuel research on second language (L2) 
learning more generally. 


Due to the self-directed nature of the Duolingo learning platform 
and the desire for holistic learner assessment, we have developed 
two assessment systems - the Checkpoint Quiz and Review 
Exercises - that allow for carefully controlled measurement of 
learner achievement. These two assessments were designed with 
the challenges gamified platforms struggle with in mind, 
including ensuring the learning experience remains motivating 
and maintaining a scalable content creation process. 


The utility of the Checkpoint Quiz and Review Exercises for 
assessing learner achievement depends, at least in part, on the 
high volume of data collected from Duolingo learners and the 
EDM methodologies that can be applied to that data. By 
leveraging predictive modeling and natural language processing 
(NLP) methods, we are able to control for the various ways that 
learners choose to navigate through the platform. Further, these 
methods allow us to uncover useful insights into how this 
variation in user navigation relates to learning outcomes - insights 
that we can leverage for product development and curriculum 
design. In this paper, we present two of our assessment systems 
and a case study highlighting the importance of applying EDM 
methodologies to derive insights from Duolingo assessment and 
log data. 


2. RELATED WORK 


Most EDM/LA applications at Duolingo focus on pedagogy- 
oriented issues [10] or computer-supported predictive analytics 
[2]. Most relevant to the current work are studies focused on 
predicting performance on upcoming course exercises [9] and 
predicting performance on an assessment [1, 4, 12] 


Rather than relying on assessment data, some systems discussed 
in other studies instead model student interaction with and 
performance on individual course exercises. Knowledge tracing 
[7] is a popular approach for maintaining a model of whether 
students have learned specific concepts in a course. One system 
[9] compared the performance of a Bayesian Knowledge Tracing 
(BKT) model with a Deep Knowledge Tracing (DKT) model 
using Long Short-Term Memory (LSTM) to better capture longer- 
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term learning. These models predicted future performance on 
exercise x... given the previous performance record for a student 
(x, ..., x). This system treats every student interaction as an 
opportunity for assessment and the model output was used for 
developing student-facing modules for progress tracking and 
content recommendation. However, knowledge tracing 
approaches primarily focus on characterizing mastery of specific 
concepts rather than providing a holistic assessment of knowledge 
or achievement. 


Other studies rely both on knowledge tracing and assessment data 
to analyze course effectiveness and provide this more holistic 
view. This approach is especially useful in more self-directed 
learning platforms. One study [4] used BKT to characterize 
learning using a digital game and used outputs from these models 
to predict post-test scores following a period of learning with the 
game. They found that mastery scores for two knowledge 
components (output from BKT models) had positive and 
significant association with post-test scores. Insights from the 
BKT model itself were also useful for identifying concepts that 
are difficult for students to master, which highlights opportunities 
for improving course effectiveness. This study also found 
evidence that learners have poor meta-cognition about their 
mastery of key concepts; when left to use the learning platform 
freely, many students continue to practice concepts the BKT 
model predicts they have mastered rather than moving on to new 
material. 


Knowledge tracing is not the only approach used for 
characterizing student behavior using clickstream or log data. To 
make log data useful for predictive modeling, many researchers 
turn to methods from NLP to aggregate events [1, 12]. Simple 
methods include calculating n-grams for particular event types. 
For example, unigrams can capture the number of times a student 
completes a particular learning module and bigrams can capture 
the number of times students complete two modules in sequence 
[12]. Such data can be used as inputs into predictive models either 
relying solely on raw n-gram counts [12] or by processing the data 
further using unsupervised machine learning methods - such as 
hierarchical clustering - to identify common sequence patterns [1]. 


3. DUOLINGO ASSESSMENT SYSTEMS 


3.1 Duolingo Course Structure 

Duolingo courses are organized into a series of units, each of 
which concludes with a Checkpoint. Courses used by the majority 
of learners have the following structure: 25-30 skills per unit with 
five difficulty /evels per skill and 5-6 lessons per level. Skills are 
designed around a particular theme (e.g., Travel). The vocabulary 
taught in the skill is aligned around that theme (e.g., hotel, airport, 
passport) and grammatical topics tend to be consistent across 
lessons within a skill. Lessons typically consist of 12-15 exercises 
designed to teach some vocabulary and/or grammatical concept. 
Duolingo curriculum designers incorporate aspects of spiral 
curriculum [5] to revisit familiar concepts in more complex 
contexts in future skills. See Figure 1 for an example of the 
typical Duolingo course structure. 


The five levels for each skill provide a scaffolded learning 
experience, where learners review the same vocabulary or 
grammatical concepts in increasingly difficult contexts. All skills 
start with a foundational Level 0 and as learners “level up” a skill 
they see the same sequence of lessons teaching the same content 
but using different exercise types. Early levels include exercises 
that focus on passive recognition, such as matching a second 
language (L2) word/picture pair with the corresponding word in 


the first language (L1); see Figure 2). Exercises in later levels are 
more difficult, as they require recall and production in the L2 
(e.g., translating an L1 sentence into L2; see Figure 2). The level 
achieved for a given skill is indicated in the user interface with a 
number inside a crown icon (see Figure 1). 


Unit 1 Quiz 


j= 
a Unit 1 post-test 


Unit 2 pre-test 


checkpoint 1 


Unit 2 Quiz 
Unit 2 post-test i 


Unit 3 pre-test 


checkpoint 2 


Figure 1. Duolingo course and Checkpoint Quiz design. 


Which of these is “the 
cheese”? 


| 
| 
lafresa 


el queso 


It & 


el pescado la carne 


Translate this sentence 


Tap the translation Tap the translation 


You are Vas 


You're going to meet my mother. 


going to madre 


meet my mother padre mi conocer 


Figure 2. Example exercise types. Top left: passive 
recognition; top right: recall and production. Bottom left: 
recall L2>L1; bottom right: recall LI>L2. 
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When learners begin a Duolingo course, not all skills in the first 
unit are immediately available; a row unlocks once Level 0 is 
complete for all skills in the prior row. For example, only the 
Basics | skill is available at first and the next set of skills in the 
row below Basics | (e.g., Phrases and Travel; see Figure 1) will 
only unlock once Basics 1 reaches Level 1. Once skills are 
unlocked, learners are free to return to them to practice previously 
studied material and “level up” the skill. Duolingo learners are, 
therefore, given agency to choose their learning path. Some 
learners prefer to attempt only the foundational level in a skill 
(Level 0) before moving on to new material, while others prefer to 
level up all skills. Leveling up is entirely optional and learners are 
required to complete only the foundational level for each skill 
before they can move on to the next unit of content. This self- 
directed nature of the learning platform provides challenges for 
assessing learner achievement. 


Other modes of learning are available to users outside of course 
skills. Learners can build reading and listening proficiency 
through the Stories feature, which reinforces unit content through 
interactive dialogues with exercises to check comprehension. 
Learners can also complete generalized practice sessions, which 
drill users on content they have already studied from throughout 
the course. Further, after learners have leveled a skill up all the 
way, they can return for skill practice to reinforce their 
knowledge. If learners find skill material too easy, they also have 
the option to “test out” of a level and jump to harder exercises at 
the next level. 


We use a variety of methods to assess learner achievement and 
proficiency throughout a Duolingo course. In the sections below, 
we describe two of the core assessments in use today: Checkpoint 
Quiz and Review Exercises. 


3.2 Checkpoint Quiz 

For a subset of Duolingo’s courses, learners must complete a 
custom-built assessment once they finish a unit and reach a 
Checkpoint. The Checkpoint Quiz is an achievement test that 
measures the extent to which our learners have achieved the 
objectives for each unit of a course. Checkpoint Quiz items 
are independent from the items used in course skills and users are 
only exposed to the quiz items during the assessment. This 
ensures that learners do not have the opportunity to learn the items 
in the assessment while studying course content and is important 
for test validity. Checkpoint Quiz items were designed by 
curriculum experts and Duolingo assessment scientists have 
conducted analyses to ensure their quality. 


Learners do not receive corrective feedback or a final grade for 
the assessment and may only take the quiz once. At each 
Checkpoint, learners complete a randomly generated quiz 
consisting of 15 items (sampled from a larger pool of items). 
Seven items are pre-test items that test the next unit of the course 
that the learner is about to start and another seven are post-test 
items that test the unit the learner just completed (critically, the 
same seven items the learner saw in the previous quiz as a pre- 
test). This pre-test / post-test design allows us to establish a 
baseline level of performance so we can later assess gain in 
accuracy from pre-test to post-test. The final item is a self- 
directed writing item designed to assess the current unit (with no 
pre-test). See Figure 1 for an illustration of Checkpoint Quiz 
design. 


The assessment tests knowledge of vocabulary, grammar, 
listening comprehension, reading comprehension, and free-form 


writing using separate items designed to test one of these language 
skills and components. Vocabulary and grammar items are a 
combination of multiple choice and fill-in-the-blank questions 
(i.e., learners type the missing word), listening and reading are 
exclusively multiple choice, and writing questions are free- 
response. Each item is accompanied by a set of curated tags for 
grammatical concepts and communicative components. 


3.3. Review Exercises 

Review Exercises prompt learners to review content from a skill 
earlier in their course. A single Review Exercise is inserted into 
randomly selected lessons in the foundational level of a skill (only 
for skills beyond the first five in the course). These exercises are 
randomly and uniformly sampled from the pool of available 
exercises from either three skills or five skills earlier in the course. 
For example, randomly selected exercises from the Animals skill 
are injected into Level 0 lessons seen by learners studying the 
Places skill (see Figure 3). These exercises are inserted into the 
lesson in a random position, as long as it is not among the first 
two or last two exercises. Therefore, lessons with Review 
Exercises will be one exercise longer than a standard lesson. 
Review Exercises come in two forms: assisted recall and 
translation from L1-to-L2 or vice versa (see bottom row of Figure 


2). 
@) 


Basics 1 


Cf Exercise sampled 
a from Animals skill 
\ay —a 


Phrases Animals 


Qe 


Food Family 


Animals exercise 
Eg injected into lesson 
in Places skill 


Transport Travel Places 


Figure 3. Review Exercise design for testing five skills earlier 
in course. 


Review Exercises as a form of assessment have a number of 
advantages over the Checkpoint Quiz: 1) Review Exercises are 
available in all courses; 2) they allow us to measure learning at 
every skill in a course, rather than just at unit-terminal 
Checkpoints; and 3) they provide an order of magnitude more data 
than Checkpoint Quizzes. 


However, Review Exercises have a few disadvantages over the 
Checkpoint Quiz. One key disadvantage is that the items used for 
Review Exercises overlap with items used for lessons with skills; 
therefore, we sacrifice some test validity in order to be able to use 
the assessment at scale across all courses and all skills in a given 
course. Further, the sentences used as Review Challenges have 
not been assessed for their quality as measures of learning. 
Another disadvantage is that the data is not tagged for 
grammatical concepts or communicative components, which 
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limits the insights this assessment can provide for informing 
curriculum design. 


Table 1. Key differences between the Checkpoint Quiz and 
Review Exercises. 


Checkpoint Quiz Review Exercises 


Fast data collection 
(at every skill in a course) 


Slow data collection 
(only at Checkpoints) 


Tagged and calibrated by 


curriculum experts Not tagged or calibrated 


Items siloed from course Items sampled from course 


Only certain courses All courses 


4. CASE STUDY 


Learners on Duolingo use the platform in a variety of different 
ways. In this case study, we investigate how learning decisions 
impact outcomes, so that we could “nudge” learners to use the app 
more effectively. 


This case study demonstrates how EDM methodologies allow us 
to investigate the various ways that learners choose to navigate 
through the platform - focusing on differences in “leveling up” 
behavior - and how course navigation relates to learning 
outcomes. We show a correlation between leveling up and higher 
accuracy on the Checkpoint Quiz. Complementary modeling with 
Review Exercise data establishes a causal link between 
completing sessions in higher levels and accuracy on assessments. 


4.1 Checkpoint Quiz 
4.1.1 Data 


Our work uses four months of Checkpoint Quiz data. For every 
learner completing at least two consecutive Checkpoint Quizzes 
within this timeframe, we collected the pre-test / post-test item 
response pairs (e.g., the pre-test responses collected at Checkpoint 
1 and the corresponding post-test responses collected at 
Checkpoint 2) as well as summary statistics on learners’ studying 
behavior in the unit the items assess (e.g., number of lessons 
completed at each level across skills in Unit 2, number of Stories 
completed between pre-test and post-test). Responses to free-form 
writing items were not included in this analysis. 


4.1.2 Methods 

To isolate the impact of lessons completed at each level on 
Checkpoint Quiz outcomes, we built a logistic regression model to 
predict post-test scores for items that were answered incorrectly in 
the pre-test (a measure of learning gain). Primary variables of 
interest capture the number of lessons learners completed at a 
given level for each skill in the unit of interest (frequency counts 
for Level | through Level 4; e.g., a learner completed 20 Level 1 
lessons, 15 Level 2 lessons, etc.). Although Duolingo has five 
levels for all skills (starting with Level 0), we exclude counts for 
the foundational level because all learners must complete the 
same number of Level 0 lessons to finish a unit. The model 
controls for item and user covariates: language component of the 
item (e.g., vocabulary), unit (e.g., Unit 2), course (e.g., French for 
English Speakers), number of sessions completed for other types 
of study material (e.g., Stories, generalized practice, test-outs), 


self-reported prior proficiency (0-10), and subscriber status! (non- 
paying or paying learner). 


4.1.3 Results 

We found that average post-test item accuracy increases linearly 
with every skill-level completed (Figure 4). In other words, each 
additional level completed across all skills increases the odds of 
answering a Checkpoint Quiz item correctly by the end of the 
unit. 


Proportion of Correct Post-Test Item Responses 


ae ee 
o 
@ 
L 
c 
°o 0.6 
Oo 
c 
2 
6 04 
a 
°o 
c 
a 
Oz Pre-Test Item Response 
ME Correct 
WE incorrect 
0.0 — - — 
25 50 75 100 125 


Number of Skill-Levels Completed 


Figure 4. Average post-test accuracy by the number of skill- 
levels completed as a function of pre-test accuracy. 


This finding was supported by the results of our logistic 
regression model (summarized in Figure 5). We observed that the 
probability of answering a post-test item correctly increases with 
every additional lesson in Levels 1, 2, and 4. Level 3 has a 
negative coefficient, but this is likely an artifact of variable 
suppression’. 


Level 1 = 

Level 2 os 

Level 3 —_— 

Level 4 —_ 
Skill Practice - 


Generalized Practice r 


Stories | . 
-0.010 -0.005 0.000 0.005 0.010 


Figure 5. Checkpoint Quiz logistic regression model output. 
Coefficients of the number of times a user completed seven 
different session types in a model including other user and 

item covariates (see Section 4.1.2). 


' Duolingo offers a paid subscription that removes ads, allows 
offline access, and includes additional features and learning 
modes. All learners have access to the same course content. 


? Because learners tend to complete the same number of lessons in 
Levels 3 and 4, we attributed the negative coefficient to the 
statistical consequence of highly collinear relationships existing in 
the correlation matrix, which can cause variable suppression and 
model instability [8]. To verify that this multicollinearity did not 
result in model instability, we repeatedly fit the model on 
bootstrapped samples of the original data. We found that small 
changes to the data do not cause any erratic changes in the 
coefficients, so we concluded that our model estimates are stable. 
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We also compared the magnitudes of the leveling up effects with 
those of other types of learning modes, specifically Stories 
(interactive dialogues to practice reading and listening skills), skill 
practice, and generalized practice (see Section 2 for more details 
about these learning modes). Coefficients capturing leveling up 
behavior show dominant effects in the model; one additional skill- 
level has a greater impact on Checkpoint Quiz scores than one 
additional Story, skill practice, or generalized practice. 


The Checkpoint Quiz findings show that providing learners with 
multiple difficulty levels to practice study material improves 
learning outcomes. Further, we found evidence that completing 
lessons at Levels 1, 2, 4 is not only positively associated with 
learning outcomes, but is more positively associated than any 
other activity. However, the Checkpoint Quiz analysis is not 
necessarily causal. The findings could also be due to self-selection 
biases, wherein the type of learner that is motivated to complete 
additional (non-required) levels is likely to perform better in 
general. A complementary analysis is required to establish a 
causal link. 


4.2 Review Exercises 

We utilized Review Exercise data to establish a causal link 
between leveling up and better learning outcomes. Review 
Exercises are better suited to this complementary analysis than the 
Checkpoint Quiz because each Review Exercise targets material 
from a single source lesson. This design allows us to compare 
learners who exhibit the same studying behavior except for the 
completion of one additional level for that lesson. Isolating the 
change in accuracy from one additional level means that we have 
controlled for self-selection biases and can interpret the change as 
causal. 


4.2.1 Data 

For the Review Exercise analysis, we collected all Review 
Exercises completed over the course of approximately two 
months. Data comes from all Duolingo courses. Along with 
Review Exercise response accuracy, we collect important control 
variables: whether the exercise came from 3 or 5 skills earlier in 
the course, exercise type, and the skill the exercise was sampled 
from (see Figure 3 for Review Challenge design). 


4.2.2 Methods 

Using logistic regression and a regression discontinuity design 
(RDD) [3, 6], we are able to model the impact of completing 
higher levels on Review Exercise accuracy while controlling for 
self-selection bias that may occur for learners who choose to level 
up vs. those who do not. An RDD is a quasi-experimental 
approach where a synthetic treatment condition is assigned to 
observations that fall above or below a certain “cut-off” point. We 
achieve this by first identifying learners who have completed any 
lessons at a given level for the skill a Review Exercise was 
sampled from (e.g., learners who have completed at least one 
Level 1 lesson). Among those learners, we define a cut-off point 
to compare those who have completed that level for the Review 
Exercise source lesson (e.g., Level 1) to those who have 
completed that level for the lesson that immediately precedes the 
source lesson but who have not yet completed that level for the 
source lesson itself (e.g., preceding lesson to Level 1, but source 
lesson to Level 0). This approach controls for most potential self- 
selection bias in deciding to level up (all comparisons include 
learners who have chosen to level up the skill) and can provide 
stronger evidence for a causal relationship between leveling up 
and Review Exercise accuracy. 


We created a variable with eight levels for use in the regression 
model to capture 1) the highest level a learner has leveled up the 
Review Exercise source lesson to and 2) whether the learner 
studied the source lesson to the same level as the preceding lesson 
(e.g., both at Level 1) or studied the source lesson one time less 
than the preceding lesson (e.g., preceding lesson at Level 2 but 
source lesson at Level 1). For example, this scheme yields 
coefficients of the form Level 1:Same Level, indicating 
learners for whom both the source lesson and preceding lesson 
were at Level 1, or Level 1:Lower Level, indicating 
learners for whom the source lesson was at Level | and the 
preceding lesson was at Level 2. This coding scheme required 
excluding certain observations. Cases where the learner had 
completed the highest level possible for the Review Exercise 
source lesson (i.e., Level 4) is not included because it is 
impossible for the lesson preceding the source lesson to be leveled 
up any higher. We also exclude observations where the source 
lesson is the first lesson of a skill because there will be no 
preceding lesson to serve as a control comparison. 


In addition to this main variable, we also control for other factors 
that influence Review Exercise accuracy: the number of skills 
away from the source skill (three or five), and the exercise type of 
the Review Exercise (L1-to-L2 translation or vice versa), and the 
difficulty of the source skill. We defined difficulty of source skills 
by computing the log-odds of answering a Review Exercise 
correctly in each skill in the data overall’. This allows us to 
control for the fact that, all else being equal, accuracy is likely to 
be lower overall for Review Exercises sampled from more 
difficult skills, which increases the power of the analysis. 


4.2.3 Results 

If leveling up causes higher Review Exercise accuracy, we 
expected to see that the Level N:Same Level (source lesson 
and preceding lesson to Level N) coefficients were significantly 
larger than the Level N-1:Lower Level (source lesson one 
level lower than preceding lesson; Levels N-l and N, 
respectively) coefficients. Such an effect would indicate that - 
controlling for leveling up behavior overall - completing higher 
levels of the lesson a Review Exercise came from yields 
significant improvements in Review Exercise accuracy. 


Figure 6 summarizes the results of our logistic regression model. 
We can see that Level 1:Same Level is significantly higher 
than Level O:Lower Level. This effect indicates that 
learners who have studied a Review Exercise source lesson twice 
(at Level 0 and Level 1) are more likely to provide a correct 
response on their Review Exercise than learners who have studied 
a Review Exercise source lesson once (only at Level 0) but 
already had studied the previous lesson twice (at Level 0 and 
Level 1). This result provides evidence for a causal relationship 
between leveling up study material and assessment performance, 
at least for the first time learners level up. The model shows 
similar trends for leveling up beyond Level | (e.g., Level 
2:Same Level is numerically higher than Level 1:Lower 
Level), suggesting this relationship continues to exist as users 
study the Review Exercise source lesson additional times 
(although perhaps with diminishing returns). 


The regression results also show significant differences between 
Level O:Same Level / Level O:Lower Level and 


3 Empirical log odds defined as log((correct + 1) / (incorrect + 


D)). 
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Level 1:Same Level / Level 1:Lower Level. 
Although the learners captured in the Lower Level coefficients 
had not leveled up the source lesson to Level 1, we see clear 
improvements in Review Exercise accuracy stemming from 
leveling up any lessons preceding the source lesson. These 
learners will not have had additional opportunity to study the 
exact exercise used for the Review Exercise, but the content and 
concepts in other lessons in the skill will have been related. 
Therefore, the benefit of studying in one lesson transfers to other 
lessons. 


Level 0:Same Level | — 


Level 0:Lower Level = 

Level 1:Same Level —_— 

Level 1:Lower Level —— 
Level 2:Same Level — 


Level 2:Lower Level 


Level 3:Same Level — 


Level 3:Lower Level 


-0.8 -0.6 -0.4 -0.2 0.0 


Figure 6. Review Exercises model output. Coefficients of 
leveling up behavior in a model including other item 
covariates (see Section 4.2.2). 


5. CONCLUSIONS 


In a case study of the levels mechanic, wherein learners study 
content in increasingly difficult contexts by “leveling up”, 
complementary analyses of the Checkpoint Quiz and Review 
Exercises showed that completing sessions in higher levels leads 
to stronger performance on assessments. Analyzing accuracy rates 
on the Checkpoint Quiz by the number of skill-levels completed 
in the course unit revealed a strong positive trend. Because 
variation in how learners navigate the platform may introduce 
self-selection bias and complicate interpretation of these results, 
we conducted an additional analysis of Review Exercises that 
controlled for this bias. The Review Exercises analysis supports a 
causal link between leveling up and improved assessment 
performance, showing that completing additional levels for a skill 
(beyond the foundational level) has measurable learning value. 


Together, these results directly motivated the implementation of a 
number of interventions that encourage learners to reach higher 
levels. For example, because learner awareness of the existence 
and purpose of levels was relatively low, we added design 
elements that give learners a visual stand-in for how the levels 
system works. Learners also now receive a pop-up with a redirect 
button upon finishing a level prompting them to start the next 
level in the skill. Randomized controlled experiments (i.e., A/B 
tests) introducing these changes showed >10% increases in the 
number of lessons completed in each level beyond the required 
foundational level and significantly more studying activity on the 
app overall. These interventions exemplify how insights from the 
Checkpoint Quiz and Review Exercises have lasting impact on the 
Duolingo learning experience. 


This study focused on one type of variation in how learners 
choose to navigate the Duolingo learning platform, namely 
leveling up. Learners can additionally choose their own study 
sequence for the skills (e.g., completing all the levels in a skill 
before starting the next skill, completing the entire course unit one 
level at a time, leveling up clusters of skills within a unit), as well 
as which types of learning material to study (e.g., course skills, 
generalized practice, Stories). Future iterations of this work will 


aim to capture such variation, thereby improving model fit and 
deepening our understanding of how other types of navigational 
choices relate to learning outcomes. Previous EDM studies [1, 9] 
provide methodologies that can be used to characterize this 
variation. 


Future work will also continue to explore the utility and 
limitations of the Review Exercise assessment system. For 
example, data from Review Exercises show promise as a method 
for measuring learning improvements over the course of an A/B 
test due to the high volume of daily data generated, highly 
localized measurement (i.e., testing learning of content from 
specific course skills), and the distributed nature of the assessment 
(i.e., testing learning in all course skills). Future work could also 
consider whether Review Exercise accuracy can be predicted 
based on engagement with (and accuracy on) source lessons in the 
past. 


Self-directed learning platforms such as Duolingo require accurate 
and well-controlled assessments to measure learner achievement. 
Because learners exercise a high degree of agency in how they 
navigate the courses, achievement cannot be adequately assessed 
by analyzing exercise responses alone. Duolingo developed two 
forms of assessment - the Checkpoint Quiz and Review Exercises 
- to capture insights about how different study approaches relate 
to learning outcomes. Applying EDM techniques to these 
assessments yields useful insights that inform our understanding 
of how the navigation of course content relates to learning 
outcomes and how we can leverage these insights to improve the 
learning experience on the platform. 
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ABSTRACT 


Today, there is a vast amount of online material for learners. 
However, due to the lack of prerequisite information needed 
to master them, a lot of time is spent in identifying the right 
learning content for mastering these concepts. A system that 
captures underlying prerequisites needed for learning differ- 
ent concepts can help improve the quality of learning and can 
save time for the learners as well. In this work, we propose an 
unsupervised approach, UPreG, for automatically inferring 
prerequisite relationships between different concepts using 
NLP techniques. Our approach involves extracting the con- 
cepts from unstructured texts in MOOC (Massively Open 
Online Courses) course descriptions, measuring semantic re- 
latedness between the concepts and statistically inferring the 
prerequisite relationships between related concepts. We con- 
ducted both qualitative and quantitative studies to validate 
the effectiveness of our proposed approach. As there are no 
ground truth labels for these prerequisite relations, we con- 
ducted a user study for the evaluation of the prerequisite 
relations. We build the concept graph using prerequisite re- 
lations. We demonstrate few examples of the learning maps 
generated from the graph. The learning maps provide pre- 
requisite information and learning paths for different con- 
cepts. 


Keywords 


Prerequisite relation, Text mining, Learning path 


1. INTRODUCTION 


In today’s fast-paced world, skill development and a strong 
foundation in fundamental concepts are becoming very cru- 
cial for career growth. MOOCs, offering a wide variety of 
courses online are becoming ubiquitous among many learn- 
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ers interested in acquiring knowledge and becoming compe- 
tent in their field of interest. In this journey, learners need 
to know the order in which they must learn different con- 
cepts to attain a good level of mastery in a specific topic. 
Knowing the prerequisites when learning a topic improves 
the learning experience of learners and is influential to the 
learner’s achievements [20]. Prerequisite concepts define the 
concepts one must know or understand first before attempt- 
ing to learn or understand something new. 


With the increasing amount of educational data available, 
automatic discovery of concept prerequisite relations has be- 
come both an emerging research opportunity and an open 
challenge. There is a growing interest today in researching 
different techniques for automatically inferring the prereq- 
uisite relations between concepts [17][20]. Various solutions 
like curriculum planning [23], learning assistant [10], auto- 
mated reading list generation [9] etc, have been developed 
based on such techniques. 


Prerequisites at the course-level have been manually curated 
by experts and this helps find prerequisite relations between 
the concepts covered within the courses. For example, con- 
cepts in a course on Optimization are prerequisites to con- 
cepts in a course on Deep Learning. An example in this sce- 
nario would be the Gradient Descent algorithm being a pre- 
requisite for understanding the Backpropagation algorithm 
used in Deep Neural Networks. Such relations created man- 
ually will not scale in real-world online applications. Mod- 
ern applications today support learning content from a wide 
variety of domains and cater to learners from multiple edu- 
cational backgrounds. Manual processes for creating prereq- 
uisite relations in such applications are expensive and time- 
consuming. Hence, it is necessary to develop solutions that 
can infer prerequisite relations using automated approaches. 


In this work, we propose an unsupervised approach, UP- 
reG, for automatically inferring prerequisite relationships 
between different concepts using NLP techniques. We built 
a concepts graph capturing the concepts and the prerequisite 
relation between them. Concepts here refer to technologies, 
programming languages, tools, and topics in the Software 
and Computer Science domain. The concepts graph can be 
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leveraged to find the right content for the learner, includ- 
ing the prerequisite content. We conducted both qualitative 
and quantitative studies to validate the effectiveness of our 
proposed approach. As there are no ground truth labels for 
these prerequisite relations, we conducted a user study for 
the evaluation of the prerequisite relations. We observed 
that our approach is effectively able to infer the prerequisite 
relations between concepts. The approach can be extended 
to other domains as well. 


This paper is structured as follows. We present the re- 
lated work in Section 2. In Section 3, we describe our ap- 
proach for concept graph generation followed by the evalu- 
ation methodology and results in Section 4. In section 5, 
we discuss the challenges we encountered while building the 
concepts graph. Finally, Section 6 concludes with future 
work. 


2. RELATED WORK 


Pan et al. [17] propose a learning-based method for latent 
representations of course concepts. They defined various fea- 
tures and trained a classifier that can identify prerequisite 
relations among concepts. Roy et al. [20] proposed PRE- 
REQ, a supervised learning method for inferring concept 
prerequisite relations. The approach uses latent representa- 
tions of concepts obtained from the Pairwise Latent Dirichlet 
Allocation model, and a neural network based architecture. 
They assumed that concept prerequisites are available to 
train supervised model. Yu et al. [24] present an improved 
version PREREQ-S by introducing students’ video watch or- 
der to enhance the video dependency network. They sorted 
the watched videos of each student by time and utilize these 
sequences for replacing the video sequences. They apply 
two simple DNN models, which first encode the embeddings 
of the concept pairs and then train an MLP to classify the 
prerequisite ones. Alzetta et al. [3] applied a deep learning- 
based approach for prerequisite relation extraction between 
educational concepts of a textbook. Lu et al. [13] proposed 
an iterative prerequisite relation learning framework, iPRL, 
which combines a learning based model and recovery based 
model to leverage both concept pair features and dependen- 
cies among learning materials. Liang et al. [12] addressed 
the problem of recovering concept prerequisite relations from 
university course dependencies. They [11] further applied 
active learning to the concept of prerequisite learning prob- 
lem. Pal et al. [16] proposed an approach to find the order of 
concepts from textbooks using the rule-based method. Prior 
work assumes the prerequisite relationship pairs available as 
ground truth and apply supervised learning approach. How- 


ever, acquiring labeled prerequisite pairs is time-consuming 
and expensive. Currently, the major drawback of supervised 
learning is that it doesn’t perform well over cross-domains 
[16]. To the best of our knowledge, we are the first to apply 
unsupervised approach to extract the prerequisite relation- 
ship for software domain. 


3. APPROACH 


In this section, we discuss our approach to build the concepts 
graph. It is a directional graph where nodes represent the 
concepts and the edges between nodes represent the prereq- 
uisite relationship between them. Our approach in building 
the concepts graph involves concept representation, measur- 
ing semantic similarity between the concepts and identifica- 
tion of the prerequisite relationship between them. 


3.1 Concept Representation 

The descriptions of the courses in MOOCs contain rich infor- 
mation about the concepts that will be taught to the learn- 
ers. Many courses do not have annotated course tags to rep- 
resent the concepts taught in the course. It is very expensive 
and time-consuming to manually create course tags from the 
course content [13]. Hence, the concepts must be extracted 
from the course content using text mining approaches. We 
collected course metadata from different MOOCs (Udemy 
and edX) and our internal Learning Management System. 
We apply Latent Dirichlet Allocation (LDA) [4], a topic 
modeling algorithm on each course description to extract 
the concepts. The algorithm generates a topical distribu- 
tion for each course description. To determine the most 
relevant topic that represents the concepts a course covers, 
the topic with highest probability from the distribution is 
selected. After performing several iterations, we found that 
setting k=5 (number of topics to be extracted) gave the best 
results. We extract a total of 9750 unique concepts. 


3.2 Semantic similarity between concepts 

The Semantic similarity measure between concepts gives a 
measure of the semantic relatedness between them. Con- 
cepts that appear in the same context or appear together 
very often have higher semantic similarity scores. Seman- 
tic Similarity computation eliminates noise present in the 
results of the topic modeling algorithm and reduces the pos- 
sibilities of weak relations in the concepts graph. It is also 
useful in prerequisite relation identification as it is likely 
that concepts appearing in similar contexts will have better 
chances of being identified with prerequisite relation. This 
improves the selection of candidates in the concepts graph. 
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Course title: JavaScript: Understanding the Weird Parts 


Course description 


JavaScript is the language that modern developers need to know, and know well. truly 
knowing JavaScript will get you a job, and enable you to build quality web and server 
applications. note: this course includes information on ECMAScript 6 (es6) the next 
version of JavaScript! 


in this course you will gain a deep understanding of JavaScript, learn how JavaScript 
works under the hood, and how that knowledge helps you avoid common pitfalls and 
drastically improve your ability to debug prob! lems. you will find clarity in the parts that 
others, even experienced coders, may find weird, odd, and at times incomprehensible. 
you'll learn the beauty and deceptive power of this language that is at the forefront of 
modern software development today. This course will cover such advanced concepts as 
objects and object literals, function expressions, prototypical inheritance, functional 
programming, scope chains, function constructors (plus new es6 features), immediately 
invoked function expressions (iifes), call, apply, bind, and more. we'll take a deep dive 
into the source code of popular frameworks such as jQuery and underscore to see how 
you can use your understanding of JavaScript to learn (and borrow) from other's good 


or library. What you'll learn in this course will make you a better JavaScript developer, 
and improve your abilities in AngularJS, NodeJS, jQuery, react, ember, mongo dB, and all 
other JavaScript-based technologies!learn to love JavaScript, and code in it well. note: in 
this course you'll also get downloadable source code. you will often be provided with 
‘starter’ code, giving you the base for you to start writing your code, and ‘finished’ code 
to compare your code to ......... 


Objectives: 

"Grasp how JavaScript works and it's fundamental concepts” 

“Write solid, good JavaScript code" 

“Understand advanced concepts such as closures, prototypal inheritance,..." 
"Drastically improve your ability to debug problems in JavaScript.” 

"Avoid common pitfalls and mistakes other JavaScript coders make” 
“Understand the source code of popular JavaScript frameworks" 


“Build your own JavaScript framework or library” 


code. finally, you'll learn the foundations of how to build your own JavaScript framework ; 


Concepts extracted from topic modelling 


JavaScript, jQuery, array, string, dom, 
event, library, ajax, object, loop 


Course labels provided by Udemy 
JavaScript 


Figure 2: Concepts generated for a JavaScript course 


46 What is the inverse of regularization strength in Logistic Regression? How should it affect 


votes my code? 


| am using sklearn.linear_model.LogisticRegression in scikit learn to run a Logistic Regression. C : float, optional 


lefault=1.0) Inverse of regularization strength; must be a positive float. ... 
1 (default=1.0) | f larization strength t b itive fl 
answer 
python machine-leaming  scikitiearn _ logistic-regression asked Apr 4 '14 at 0:18 
31k views = €} user3427495 
bd cM 838 01 07 09 
32 scikit-learn return value of LogisticRegression.predict_proba 
voles What exactly does the LogisticRegression.predict_proba function return? In my example | get a result like this: [[ 
4.65761066e-03 9.95342389e-01] [ 9.75851270e-01 2.41487300e-02] [ 9.... 
python machine-learning§ scikitlearn probability logistic-regression asked Apr 17 '16 at 19:52 
Zelphir Kaltstahl 
42k views 4,031 e6 © 43 e76 


Figure 3: Stack Overflow questions and tags 


To measure semantic similarity between the concepts we 

compute Pointwise Mutual Information (PMI) and Word2Vec 
cosine similarity scores. The Semantic similarity scores be- 

tween the concepts are computed as the weighted average of 

the two scores. 


3.2.1 Pointwise Mutual Information 

PMI gives a measure of concept association used in informa- 
tion theory [6]. It gives a measure of how likely two concepts 
would occur together when compared to their independent 
occurrences in the data. For computing the PMI of con- 
cept pairs, tags of Stack Overflow questions obtained from 
Stack Overflow data dumps were used. The author posting 
a question on Stack Overflow is asked to provide tags as- 
sociated with the posted question (as shown in Figure 3). 
Tags that appear often together across all the questions are 
likely to be strongly related. Higher the score between the 
two concepts, the more similar they are. We assume that 
the concepts occurring together have some correlation over 
a large set of pairs. To compute the PMI scores, we lever- 


age the Stack Overflow dump consisting of 1,000,000 Stack 
Overflow questions along with their tags [21]. PMI score 
between any two concepts ci and cz is defined as: 

1) 


PMI(c1,c2) = max (0. log [p(c) - ple2)] 
log p(c1, 2) 

Here p(c1, c2) is the probability of co-occurrence of concepts 
c, and cg. It is fraction of Stack Overflow questions in which 
concepts c; and cz co-occur as tags. p(ci) and p(c2) is the 
probability of the independent occurrence of concepts c; and 
c2 as tags across all Stack Overflow questions. The score 
obtained is a normalized score that takes values between 0 
and 1. This ensures PMI and Word2Vec similarity scores 
have the same scale when taking their weighted average. 


3.2.2 Word2Vec Embeddings 


Raw word frequency is not a great measure of association 
between words. One problem is that raw frequency is very 
skewed and not very discriminative. It also does not capture 
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the kinds of contexts shared between the words, which word 
embedding techniques capture [2]. We apply Word2Vec ap- 
proach to learn semantic relatedness between concepts. The 
Word2Vec model is based on the intuition that words which 
are similar in context appear closer in the word embedding 
space. Word2Vec algorithm uses a neural network model 
to learn word associations from a large corpus of text. We 
use skip-gram model [15] to learn word embeddings which 
are low dimensional vector representations of the extracted 
concepts. The neural network is trained using a text corpus 
of course descriptions. We train the skip-gram model for 
generating 300-dimensional word embeddings. Word2Vec 
neural network is trained using the text corpus of the course 
description and objectives. We train Word2Vec model on 
a corpus of 64,150 courses using the Python library gen- 
sim [18] with default parameters. Some of the Word2Vec 
similarity scores between concepts are captured in Table 2. 
Word2Vec(W2V) similarity score between the concepts is 
computed as the cosine similarity between these word em- 
beddings. 
C1°C2 

W2V(erse2) = Tes ical oy 
Here ci and c2 represent 300 dimensional embedding vectors 
of concept ci and co. 


Finally, we compute the similarity score as a weighted av- 
erage of the above two scores. For simplicity, we set the 
weights to 0.5. 


Sim(ci, ce) = wi: W2V (er, c2) + we: PMI (c1,¢2) (8) 


We observed that extracted concepts can appear with dif- 
ferent representations in the Stack Overflow question tags. 
Examples include synonymous pairs such as node.js and 
nodejs, javascript and js, mvc and model view controller, 
etc. To identify such instances, we use the Stack Overflow 
synonym tag api [22] and identify the matching or synony- 
mous concepts in the Stack Overflow tags. We also filter 
out irrelevant concepts having no occurrence or synonyms 
in the Stack Overflow tags. After this process, we end up 
with 5200 concepts. During the computation of probabili- 
ties for PMI scores, we also consider the occurrence count 
of the synonyms. For example, when computing PMI be- 
tween javascript and any other concept, we compute the 
independent and co-occurrence probabilities by counting oc- 
currences of both javascript and js tags in the Stack Overflow 
questions. 


3.3 Identifying Concept Relation 

In this section, we explain the process of identifying the 
prerequisite relationship between different concepts. We 
only consider the concept pairs with high semantic similar- 
ity scores. It is very likely that concept pairs that have very 
low semantic similarity scores are not related at all and we 
can ignore such pairs. For example, it is not useful to learn 
the relationship between Neural Network and PHP which 
are not related and occur in different domains (deep learn- 
ing and web development respectively). However, it would 
be interesting to study the concept pairs Gradient Descent 
and Backpropagation which are algorithms used in machine 
learning and share high semantic similarity scores. Inferring 
the relation that Gradient Descent is a prerequisite of Back- 
propagation and not vice-versa would be useful. To infer 


such relations, we make use of Wikipedia articles. For each 
pair of concepts with high semantic similarity (threshold of 
0.5), we compute the concept relevancy scores. For concepts 
ci and cz, we measure how often the concept ci is referred in 
the Wikipedia article of concept c2 and vice-versa. Based on 
the concept relevancy scores, we can infer the prerequisite 
relation. For example, we know that Java is a prerequisite of 
Spring Boot. So, it is quite possible that in an explanation 
for Spring Boot (a Java Web framework), the concept Java 
would be mentioned more often when compared to the con- 
cept Spring Boot being mentioned in an explanation about 
Java. Algorithm 1 captures the steps to identify the prereq- 
uisite relation between concepts. 


Algorithm 1 Prerequisite relation inference between concepts 


Input: Pair of concepts c; and c; which are strongly related, 
and Wikipedia Knowledge articles. 
Output: Relationship between concept pairs (prerequisite 
relationship) i.e. ci is prerequisite of c2 or vice-versa 
1: Tokenize the knowledge articles for all the concepts 
(C;,), where Cy, is set of concepts 
2: for ordered pair concepts (ci, c;) do 
3: | Compute Concept Relevancy scores (CRS) for ordered 
pairs (c;, cj) as 


TF(c, € D; 

CRS(ci, cj) = bebe 
as wl 

TF (c; € D; 

CRS(cj, ci) = ee 
a) J 


where c; and c; are the concepts for which CRS is 
computed, D; and D; are the wikipedia articles for 
concepts c; and c; respectively, TF'(c; € D;) captures 
the term frequency for concept c; in wikipedia article 
D;, TF (cj; € Dj captures the term frequency for con- 
cept cj in wikipedia article D;, and V(D;, Dj) is the 
normalization term that captures the total vocabulary 
in articles D; and Dj. 

4: If CRS(ci,cj) > CRS(cj,c:), then cj is prerequisite 
of c; and vice-versa 


5: end for 


Table 1: Data collected from online learning platforms 


Platform # Courses | Categories 

Udemy 13601 Software development, 
and design 

Edx 1072 Software development 

Internal LMS | 49202 Software development, 
and design 


3.4 Learning Maps 

The identified prerequisite relation pairs were used to build 
the concept graph. The concept graph has 1325 concepts 
and 1868 edges. We use networkx [8] python library to build 
the concept graph. We pass the adjacency list created from 
the identified concept-prerequisite pairs as an input to the 
library. The edges in the graph have directions from the 
concept node to the prerequisite node. The learning maps 
are built for each concept in the graph using the Depth-first 
search (DFS) algorithm. They are represented as DFS trees 
generated by the algorithm. To visualize the learning maps 
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Table 2: Semantic similarity scores from Word2Vec Embeddings and PMI 


C1 C2 PMI Scores W2V Scores 
Hadoop Hive 0.67 0.43 
MongoDB NoSQL 0.64 0.72 
JavaScript jQuery 0.44 0.68 
JavaScript NodeJS 0.57 0.61 
Neural Network Backpropagation 0.74 0.61 
Blockchain Cryptocurrency 0.34 0.73 
Inheritance Polymorphism 0.54 0.62 
ASP.NET Java 0.17 0.08 
NodeJS Promise 0.54 0.21 
ASP.NET CH 0.20 0.42 
Hadoop Java 0.1 0.23 
SVM Classification 0.62 0.43 
RDBMS SQL 0.19 0.37 
Machine Learning Linear Algebra 0.18 0.49 


we use d3.js force layout [5]. In visualizing the learning maps 
we reverse the edge direction, i.e, from prerequisite node to 
concept node. This is done for the purpose of meaningful 
and easy identification of prerequisites in the learning maps. 
The learning maps for the concepts Blockchain and Java 
Spring framework are shown in Figure 4. The root node 
colored in blue represents the main concept and all nodes 
below the root node colored in orange represent the con- 
cepts that are prerequisites for the main concept. The child 
nodes represent the prerequisite concepts for its parent node 
concept. 


Table 3: Extracted prerequisite relation between concepts 


Ci C2 
Distributed systems Mapreduce 
Probability Logistic Regression 
Encryption Cryptography 
Smart Contract Ethereum 
Backpropagation Neural Networks 
Regression Neural Networks 
JavaScript NodeJS 


4. EVALUATION AND RESULTS 
4.1 Datasets 


We collected metadata about various courses from MOOC 
platforms and our internal Learning Management System 
(LMS) using REST APIs. We fetched data from categories 
relevant to Software Development and Design. The distri- 
bution of the number of courses fetched from different plat- 
forms is shown in Table 1. There are 13,600 courses from 
Udemy, 1,050 courses from edX, and 49,500 courses from 
our LMS in the Software Development and Design category. 
The output from the REST APIs was in JSON format and 
each had a different schema. Hence, we selected MongoDB, 
a NoSQL database to store the retrieved data. 


We apply text pre-processing on course metadata. Specif- 
ically, the course descriptions from Udemy contain HTML 
tags. We parse the HTML tags in course descriptions us- 
ing Beautiful Soup [19]. We remove stopwords and apply 
Lemmatization and Stemming to reduce words to their base 


forms. We also create custom stopwords manually by ana- 
lyzing the topic modeling output. We stored pre-processed 
data in MongoDB for further processing and evaluation. 


4.2 Evaluating extracted concepts 

We apply Latent Dirichlet Allocation (LDA), a topic mod- 
eling algorithm to infer topics from the course descriptions. 
We extract five topics from each course description. Each 
topic is a vector representation that not only indicates the 
words belonging to the topic but also the probability of the 
words belonging to the topic. From the topical distribution 
for the course description, the words from the topic with 
maximum probability were considered and stored against 
each course metadata as tags in the database. Figure 2 
shows the description and the tags obtained for a Javascript 
course in Udemy. 


To evaluate the concepts extracted from the course descrip- 
tion, we apply the Overlap Coefficient to measure the sim- 
ilarity between the concepts extracted from the course de- 
scription and concepts tagged by Udemy. The overlap co- 
efficient, or Szymkiewicz—Simpson coefficient, is a similarity 
measure that measures the overlap between two finite sets 
[1]. It is related to the Jaccard index and is defined as the 
size of the intersection divided by the smaller of the size of 
the two sets. Mathematically, we define the Concept overlap 
coefficient as 


[Xn Y| 


concept_overlap(X,Y) = min( XPD (4) 


where concept_overlap(X, Y) captures the average concept 
overlap between two sets X and Y, X is the concepts ex- 
tracted from topic modeling, Y is the concepts tagged in 
course descriptions of Udemy dataset, and N is the num- 
ber of course descriptions in the dataset. We observed the 
average concept overlap coefficient to be 0.97. This shows 
that the concepts extracted from the topic modeling algo- 
rithm quite well capture the relevant concepts covered in the 
course. Udemy’s course description contains a maximum of 
two concepts tagged. We further analyzed how well our ap- 
proach is able to identify the other concepts from course 
descriptions, not captured in Udemy’s concepts tag. We 
performed a quantitative analysis with 20 Subject Matter 
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Figure 4: Learning maps for Blockchain and Java Spring framework 


Experts (SMEs). The SMEs are having experience ranging 
from 5-10 years and have worked on different technologies in 
IT companies. We randomly sampled 100 courses offered on 
Udemy and provided five courses to each SME along with 
inferred concepts for each course. The SMEs were asked to 
provide their response on whether these inferred concepts 
are relevant for the course or not. We computed the ac- 
curacy considering SME’s responses as true labels. We ob- 
served the accuracy of inferred concepts to be 0.73. 


4.3 Evaluating concept Prerequisite Relations 
There are no ground truth labels available for inferred pre- 
requisite relationships. To assess the effectiveness of pre- 
requisite pairs generated by our approach, we conducted a 
quantitative analysis with 25 SMEs to identify if a concept 
c1 is a prerequisite for another concept cz. We created five 
groups with 5 SMEs in each group. We randomly sampled 
250 concept prerequisite pairs. Each group is provided with 
50 concept prerequisite pairs. We used the Majority voting 
approach to aggregate their responses. We computed the 
accuracy of these pairs considering the SME’s response as 
ground truth labels. We observed the accuracy of concept 
prerequisite pairs to be 0.82. We also measure inter-rater 
agreement amongst experts using Fleiss’ Kappa [14]. Fleiss’ 
Kappa is a statistical measure for assessing the reliability of 
agreement between a fixed number of raters when assigning 
categorical ratings to a number of items or classifying items. 
If the raters are in complete agreement then « =1. If there 
is no agreement among the raters (other than what would 
be expected by chance) then & < 0. We observed « coeffi- 
cient to be 0.74 which indicates a level of strong agreement 
among the raters. We believe some level of disagreement 
may be due to the fact that prerequisites can be subjective 
[12] i.e. it is difficult to get consensus for some pairs of con- 
cepts. Different individuals may have different experiences 
of acquiring knowledge on specific topics, and this may lead 
to different opinions of the prerequisite requirement for a 
topic. Some of the extracted prerequisite relationships are 
shown in Table 3. 


5. CHALLENGES 


Some of the challenges that we faced while building the con- 
cepts graph. 


1. For some concepts extracted from the course descrip- 
tion we had disambiguation issues when checked in 
Wikipedia. For example, Java can refer to a program- 
ming language or an island in Indonesia. To deal with 
this issue, we pass the extracted concepts to google 
search API [7] and fetch the Wikipedia article that is 
ranked higher in the search results. Due to the popu- 
larity of these software concepts, we observe that rele- 
vant results were returned by picking the higher ranked 
Wikipedia article from the search queries. 


2. Our inference of prerequisite relationships is based on 
reference scores computed from Wikipedia articles of 
the concepts. These scores may not always provide ac- 
curate results. It is possible that articles for some of 
the concepts may have high reference scores for con- 
cepts that are derived from it and not vice-versa. 


6. CONCLUSIONS AND FUTURE WORK 


In this paper, we proposed our approach to infer prerequi- 
site relations between concepts and build the concept graph. 
The proposed method does not require manually annotated 
data which was the major drawback of supervised learning 
approaches. We use relevant data sources in different steps 
to incorporate relevant and rich semantic information to in- 
fer prerequisite relations accurately. To validate our results, 
we performed both quantitative and qualitative evaluations. 
The identified concept prerequisite pairs were evaluated by 
subject matter experts. We observed an accuracy of 0.82 
for the inferred prerequisite relations. We built the concept 
graph from the prerequisite relation pairs and demonstrated 
few examples of the learning maps generated from the con- 
cept graph. Learning maps can be used in many applica- 
tions ranging from content-based recommendation systems 
to more sophisticated online tutoring systems etc. As future 
work, we plan to extend our research by creating a personal- 
ized curriculum planner system that captures the concepts 
learners currently know and what they want to learn. By 
leveraging this information, the system will create a person- 
alized learning plan for them using their input information 
and prerequisite relations. Although, our approaches are not 
limited to the software domain, we plan to carry out further 
studies and experimentation to measure the system’s gener- 
alization to other domains. 
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ABSTRACT 


We propose an adaptation of the Glicko-2 rating system in 
a K-12 math learning software setting, where variable time 
intervals between solution attempts and the stratification 
of student-item pairings by grade levels necessitate modifi- 
cation of the original model. The discrete-time stochastic 
process underlying the original system has been modified 
into a continuous-time process to account for the irregular- 
ity of intervals between solution attempts. Also, concep- 
tual prerequisite relationships between items were used to 
provide initial rating estimates that allow for rating values 
to be meaningfully compared across grade levels. Fitting 
the model using real student learning data results in rating 
value distributions successfully exhibiting a gradation with 
the increase of grade level. A potential area of application 
in a personalized education setting is also briefly discussed. 
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1. INTRODUCTION 


We consider the problem of assigning appropriate curricu- 
lum levels in a large-scale K-12 math learning software to 
students who are substantially ahead or behind their peers. 
Previous studies have suggested the importance of matching 
learning content difficulty to a student’s ability for positive 
student learning outcomes [3, 10, 16]. In light of this, stu- 
dents who are much farther ahead (e.g., gifted students) or 
behind their peers (e.g., students with learning disabilities) 
can benefit much from receiving a more tailored educational 
feedback, based on learner and skill models that can model 
their differences more effectively. 


With the recent advances in computing devices, various ap- 
proaches have been sought to harness the power of comput- 
ing to model learners more accurately in educational con- 
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texts, as comprehensively overviewed in [2]. In one particu- 
lar line of approach [11, 13, 12, 15], dynamic paired compar- 
ison models were used to quickly estimate student abilities 
and item difficulties in a scalable manner. In these adapta- 
tions, the players consist of students (“users”) and units of 
learning task (e.g., problem items, assignments), and each 
solution attempt is conceptualized as a match between a stu- 
dent and a learning task, in which the winner earns 1 point 
and the loser earns 0 points (with no draw). The primary 
advantage of such models over traditional IRT methodolo- 
gies is in their ability to compute ability estimates “on the 
fly” [11] while retaining a similar mathematical structure to 
IRT. 


The problem occurs, however, when the dataset is strati- 
fied—i.e. when student-problem pairings can be grouped 
into distinct (or largely nonoverlapping) groups such that a 
problem’s rating cannot be adequately adjusted by a student 
outside the group to which it belongs. In a K-12 math learn- 
ing software, because students are only exposed to prob- 
lems appropriate for their grade level, grade levels serve as 
strata. Consequently, we cannot adequately tell how a stu- 
dent would perform outside of their regular grade level just 
by looking at the student’s rating value. See Fig. 1 for an 
illustration. 


Ideally, we would not have this problem by gathering enough 
learning data from a large number of students for 12+ years, 
during which they would work through all curricula offered 
by the product in sequence. However, in a commercial edu- 
cational software context where a user is not bound to use 
products from just one vendor, this is highly impractical. 


Hence we raise a question: is there a way to enforce rating 
values to reflect the relative positions of the strata, despite 
the absence of sufficient overlaps in students/items among 
them? One possible strategy is to initialize the ratings differ- 
ently for each stratum according to their relative positions, 
e.g., to initialize first-grade rating values to 100, second- 
grade rating values to 200, etc., and then let the dynamic 
paired comparison algorithm do the calibration within each 
grade level. But then how could we justify that the initial 
estimation done for all curricula is properly reflective of their 
actual difficulties relative to one another? 


Here, the key insight is that the partial ordering of mathe- 
matical concepts due to prerequisite relationships provides 
a basis for the division of concepts into grade-level curric- 
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Figure 1: An illustration of the impact of data stratification on the rating interpretability. As a result of 
stratification, the distributions of rating values can overlap unreasonably much with each other, and the 
corresponding mean rating values may not align with the actual order of grade levels. 


ula, which then in turn stratifies the learning data. In the 
K-12 math learning software used in our study, each prob- 
lem item is conceptualized as a particular instantiation of 
a mathematical concept (“knowledge unit,” or just “unit”) 
with specific values. These mathematical concepts have pre- 
requisite relationships defined among them, the collection of 
which can be represented as a directed graph. We attempt 
to employ these relationships to obtain statistically inter- 
pretable and contextually appropriate estimations. 


Specifically, our contribution is twofold: 1) modification of a 
dynamic paired comparison rating system model to account 
for imbalance in rating update frequencies between students 
and items, and 2) use of prerequisite relationships between 
concepts for rating initialization to achieve rating compara- 
bility between curriculum levels. We aim to yield, from a 
stratified dataset, a set of ratings that can be meaningfully 
compared across grade levels: where students and items in 
a lower grade level would generally have lower ratings than 
those in a higher grade level. 


The remainder of this paper is organized as follows. Sec- 
tion 2 presents our particular adaption of a dynamic paired 
comparison model, including the details for incorporating 
the conceptual prerequisite information into rating initial- 
ization. Section 3 describes the dataset used for evaluating 
our model and presents our results. Section 4 discusses the 
potential for applying our model to assign grade levels for 
students far ahead or behind their peers, lists some of the 
limitations of our work, and suggests a few possible direc- 
tions for further research. 


2. MODEL 


The Glicko-2 rating system [7] falls under the family of dy- 
namic paired comparison models, along with the Glicko rat- 
ing system [6] (its predecessor) and the Elo rating system [4] 
(of which the two Glicko systems are extensions). Improv- 
ing upon its predecessor, the Glicko-2 rating system models 
the change in variance of player strength as another stochas- 
tic process, thereby accounting for the possibility of sudden 
changes in strength. More specifically, the algorithm models 
the change in player strength per unit time with a normal 
distribution with variance equal to the square of the rating 
volatility, whose logarithmic change per unit time is itself 


normally distributed. 


2.1 Continuous-time Glicko-2 Model 

The original Glicko-2 system presented in [7] assumes the 
underlying stochastic processes to be discrete-time, where 
the overall measurement period is discretized into time in- 
crements called “rating periods.” Within each rating period, 
the matches are assumed to occur simultaneously. However, 
because there is too much imbalance in the average number 
of matches between users and items, [7]’s recommendation 
of having 5-10 matches per rating period for every player is 
not feasible to implement in our application context. [15] 
has successfully worked around this limitation by constrain- 
ing each rating period to contain only one match, but the 
workaround did not account for an increase in rating un- 
certainty due to the passage of time, which is a key feature 
of the Glicko rating system family. Here, we take the ap- 
proach of modifying the Glicko-2 model under a continuous- 
time stochastic process framework, so that the model can 
account for rating uncertainty increase due to the passage 
of time without discretizing the measurement period. 


Let 6,(t) denote the ability estimate of user s at time t, and 
let 8;(€) denote the difficulty estimate of unit i at time t. 
Then as a result of using continuous-time stochastic process 
framework, the model equations for latent trait parameters 
become 


A(t) ~ N(us(t), o5(t))- (1 
0,(t+At) | s(t), o3(t+At) ~ N(O5(t), Atoz(t+At)) (2 


logos(t + At) | logaz(t), 7? ~ N(logez(t), 77) (3 
for user ability estimates, and 


Bilt) ~ N(ui(t), ¢7(t) (4 


for unit difficulty estimates. Here, as in [8], 4. denotes rat- 
ing, ¢ denotes rating deviation (RD), and o denotes rating 
volatility. Note that the difficulty of a mathematical con- 
cept is expected to remain constant over time, so we do not 
impose any stochastic volatility assumption on §;(t). 


As for the correctness probability (i.e., the probability of 
user s correctly answering an instantiation of unit 7 at time 
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t), the Glicko rating system family differs from the Elo rating 
system in its incorporation of rating uncertainty to calculate 
this quantity. We are generally interested in the correctness 
probability before the user s actually attempts unit 7. How- 
ever, the time elapsed between the user’s last attempt and 
the current attempt can vary throughout the user’s activity 
history, which also varies the amount of inflation to apply 
each time on the user’s rating uncertainty, ds. Hence we 
apply equation (2) prior to calculating the correctness prob- 
ability. Let t, and t; denote the last time user and unit latent 
trait estimates, respectively, were updated. Let Yz,:(t) be a 
Bernoulli random variable denoting user response correct- 
ness. Then the correctness probability is given by: 


Pr(¥s,i(t) = 1) = E(us(t), mit), G3) + 47) (5) 


-—1 
where E(u, p2, ¢?) = larg He Maan) is the ex- 


pected score function that accounts for rating uncertainty 
[7], and 


Here, we use o2(t;) in place of o2(t) to estimate d2(t), al- 
though their equivalence only holds in expectation. 


After user s finishes solution attempt for unit i with result 
ys,i € {0,1}, the update equations for latent trait estimates 
are given as below, following [7]’s derivation of corresponding 
equations under the continuous-time framework: 


02(t) = exp (argmaxp(a(t)|v..)) (6) 


s(t) = Hs(ts) + $2(t) - 9(d2()) - salt) — Es(#) (9) 
pi(t) = wa (ti) + 47 (t) - (G2) - (1 — ys.a(t)) — Bi(Z)) (10) 


In these equations, we have 


o E(t) = E(us(t), nilt), 62(), 


© Ex(t) = E(ui(t), us(t), e2(t)), 


© 02(t) = [9(¢8(t)?B.()A — B(t)] and 


© 02(0) = [oF — B()] 


Also, in equation (6), p(a(t)|ys,i) is the marginal posterior 
density function for a(t) = log o2(t), approximated using the 
product of the following two normal density functions (here, 
y(z;m,¢*) denotes the normal density function with mean 
m and variance <*): 


1. y(a(t); a(ts), 7°), which comes from equation (3), and 


2. p(03 (t); Holts), 62(ts) + (t— te)e% + 02(t)), which is 
the normal approximation of the marginal likelihood 
distribution of 6;(t), whose mode is denoted with 6; (t). 


The latter normal density function features the quantity 
(03 (t) — s(ts)), which is approximated in [6] using first- 
order Taylor expansion. 


Finally, note that to prevent a rating deviation from becom- 
ing arbitrarily large, the quantity is constrained in equa- 
tions (7) and (8) to never exceed the value for a brand new 
user/unit, just like how it was done in [5]. 


2.2 Initial Parameter Estimation 

To address the stratification issue mentioned in the intro- 
duction, the user and unit ratings are differentially initial- 
ized based on their respective curricula. Instead of setting 
each curriculum’s initial rating value arbitrarily, we want 
the values to reflect more closely our prior knowledge of the 
distributions of concepts within each curriculum. 


We find this prior knowledge in our proprietary conceptual 
precedence graph, where units are represented as nodes (ver- 
tices) in a directed graph. Each edge (u,v) in the graph is 
interpreted as: “An instance of unit u is being used as a step 
in solving an instance of unit v.” Hence unit u corresponds 
to a prerequisite concept that a user must have mastered 
before being able to successfully master unit v. 


The key idea in our usage of the graph is that a question 
item (corresponding to a specific knowledge unit) that in- 
volves one or more steps to solve must in general be harder 
than any of the steps themselves. Hence we assign each unit 
with a non-negative integer value, which we call “depth,” in 
such a way that for every edge, the tail node is assigned 
with a lower depth value than the head node. This way, a 
concept appearing in a higher grade level would in general 
correspond to a higher depth value (since they would gener- 
ally incorporate lower-level curriculum concepts as prereq- 
uisites), making the depth values roughly signify how “in- 
depth” the corresponding concepts are. See Fig. 2 for an 
illustration. 


We also seek to differentiate among units with no parents 
(ie., concepts with no prerequisites) by imposing that the 
depth difference between a unit and its successor be as small 
in magnitude as possible, while still ensuring that every unit 
has a strictly greater depth value than any of its parents. 


From a graph theory perspective, the problem of assigning 
depth values can be formulated as a variant of layer assign- 
ment problem on a directed acyclic graph G = (V(G), E(G)) 
with minimal dummy vertices, formally stated as the follow- 
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Figure 2: Illustration of assigning depth values to 
knowledge units in a simple conceptual precedence 
graph. Knowledge units are represented as nodes 
(gray ovals). On the right of each oval, a red circle 
shows the corresponding depth values assigned. 


10232 


Figure 3: Three instances of simple cycles in the con- 
ceptual precedence graph used in our study, which 
all belong to one strongly connected component. We 
found that cycles exist mostly due to the presence of 
“gateway units” (shown in cyan ovals), whose main 
role is to select which concept to apply from multiple 
related concepts. 


ing integer linear program (ILP): 


min pe d(v) — d(w) 
(u,v)EE(G) 11) 
s.t. d(v)—d(u) >1 V(u,v) € E(G) ( 
d(v) € Zo We V(G) 


(here, d(v) denotes the depth value assigned to node v). For 
a general overview of the layer assignment problem and its 
variations, readers are referred to Section 13.3 of [9]. 


Two challenges arise in initializing rating values through 
solving the depth assignment problem. The first challenge is 
that our conceptual precedence graph could contain cycles, 
such as ones shown in Fig. 3. To address this challenge, we 
assign the same depth value to all units in the same strongly 
connected component (SCC), noting that any directed cy- 
cle is strongly connected. Implementationally, this corre- 
sponds to solving the ILP given in (11) on the conceptual 
precedence graph’s condensation, which is a directed acyclic 
graph formed by contracting each SCC into one node. 


The second challenge in assigning depths to nodes on the 
conceptual precedence graph is that the graph (and thus also 
its condensation) may consist of multiple weakly connected 
components (WCCs), which are subgraphs whose underly- 
ing undirected graphs are connected. The above ILP assigns 
depth values relative only to other SCCs in the same WCC, 


so additional steps must be taken to equate the depth value 
distributions for each curriculum across all WCCs. In par- 
ticular, we label each SCC with the lowest-level curriculum 
that features at least one of its constituent units. Next, we 
take the smallest number of WCCs that together contain all 
curriculum labels. We call this collection of WCCs reference 
WCCs. Afterward, we offset the depth value for each SCC in 
every non-reference WCC to be at least the minimum depth 
value of all SCCs in the reference WCCs that are labeled 
with the same curriculum. 


Once the adjusted depth values for all SCCs (and thereby 
all units) are thus computed, each curriculum’s depth value 
is set to be the average depth value of all units in the cur- 
riculum. 


Below is the summary of procedure for assigning depth d(k) 
for each curriculum k € X = {1,..., K}: 


1. Let G = (V(G), E(G)) be our conceptual precedence 
graph, which is a directed graph such that each node 
v € V(G) is associated with a curriculum y(v) € X. 


2. Condense G to yield a directed acyclic graph C = 
(V(C), E(C)). 
3. Let Wi,...,Wn» be WCCs of C, from largest to small- 


est. 


4. For each W; = (V(W;i), E(Wi)), solve the ILP given in 
(11) to yield pre-adjustment depth values dinit(S') for 
each SCC S. 


5. Label each SCC S with a curriculum 


Xmin(S) = min. x(v) 


6. Let A= {W,,...,W,} be the reference WCCs (defined 
above), such that r is minimized; i.e., choose no more 
WCCs than necessary. 


7. For each curriculum k € X, let 
dmin(k) = min{d(S) | Xmin(S) = k,5 € (J V(Wi)}. 
i=1 


8. For each W; = W,41,...,Wn, adjust depth value d(S) 
for each SCC S € W; to be at least dmin(Xmin(S)). 
However, do so in a way that the adjusted depth values 
still satisfy the constraints of the ILP given in (11). 


9. We now have the adjusted depth values for every SCC 
S €V(C). For each SCC S, let d(v) = d(S) for all 
veES. 


10. For each k € X, let 
d(k) = mean{d(v) | v € V(G), x(v) = k}. 


We now give each user s or unit 7 associated with curriculum 
k as follows: 


j1s(0) = Hmin + dk) (12) 
1i(0) = min + oe d(k) (13) 


where quantities min and a are hyperparameters to be op- 
timized. 
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3. EVALUATION 


We evaluate our model using a dataset consisting of stu- 
dent practice records from January 2016 to December 2019 
through our adaptive software used in math learning cen- 
ters located throughout the United States. Students are 
given problems to practice based on their current grade level 
and the content areas where they struggle. The data con- 
sists of 5,179,493 records of 10,194 users’ combined attempts 
for problems associated with 7,513 knowledge units, ranging 
from Grade 2 concepts to Algebra 2 concepts. When a stu- 
dent gets a problem wrong in the first attempt, the student 
gets to make a second attempt for the same problem after 
being walked through the steps; in our analysis, however, 
only the first attempt’s result was considered. 


For the Glicko-2 model hyperparameters, we used the values 
suggested in [8]: 350.0 for the initial RD (in Glicko-1 scale; 
[8] shows how to convert between the two scales) and 0.06 for 
the initial user volatility. In the case of 7, for which a range 
of values is suggested, we used 0.5. The time elapsed from 
one attempt to the next, used in rating uncertainty inflation, 
is measured in days. Finally, through extensive simulations, 
we chose a & 0.2303 and fmin & —2.8782, which, in Glicko- 
1 scale (on which the values were originally set), are exactly 
40.0 and 1000.0, respectively. 


Each unit’s associated curriculum was based on the infor- 
mation provided in our content management system. For 
units appearing in multiple curricula, the earliest curricu- 
lum in the sequence was used. For users, due to the lack 
of availability of exact registration dates for all users at the 
time of the study, each user’s curriculum was set as the cur- 
riculum associated with the first unit attempted by the user. 
The initial parameters for both users and units were then 
set following the procedure described previously. 


3.1 Predictive Performance 

To assess the predictive performance of our adaptation of 
the Glicko-2 rating system, we plotted the change in RMSE 
values for every 1,000 records over time (for the rationale 
behind the metric choice, see {14]). As the latent trait esti- 
mates are calibrated based on student practice records, we 
expect the RMSE across the entire system to decay over 
time. We see that this is exactly the case in Fig. 4, where 
the calibration curve for our model is also reported along 
with the reliability and resolution values. 


We also report a convergent pattern in unit rating values 
and dynamically adjusting user rating values, analogous to 
the results obtained in [15], in Fig. 5. 


3.2. Gradation of Unit Rating Distributions 


We also plot the distributions of the final unit rating values 
for each curriculum. We expect that using a conceptual 
precedence graph to initialize rating values would cause the 
central tendencies of the rating distributions would show an 
upward trend as the curriculum level increases. As shown 
in Fig. 6, the final ratings computed without the graph- 
based rating initialization fail to show an upward trend in 
the mean rating values, whereas they do with the graph- 
based rating initialization. Also noteworthy is the complete 
disappearance of overlap in IQR between two curricula far 


Cumulative RMSE 
(Overall RMSE = 0.42448833497119265) 


0.50 — RWMSE (plotted at every 1000 records 


oO 100000 200000 300000 400000 500000 
Cumulative number of logs processed 


Reliability Diagram 


ae ee perfectly calibrated 
—= actual (reliability = 0.00149) 


resolution = 0.04707 


Observed Relative Frequency 


” 0.0 05 10 
oo} ~ Predicted Relative Frequency 


0.0 0.2 0.4 0.6 08 10 
Predicted Relative Frequency 


Figure 4: Top: Cumulative RMSE values calculated 
at every 1,000 records. For effective visualization, 
only results from the first 500,000 records were plot- 
ted. Bottom: Reliability diagram with sharpness 
graph inserted in the lower right. 
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Figure 5: Rating values as a function of time for 5 
most frequently attempted units (top) and for the 
user with the most number of attempts (bottom). 
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Figure 6: Final distributions of knowledge unit ratings. Orange bars indicate medians, green dots indicate 
means. Note that the rating values on the vertical axis are on the Glicko-1 scale. 


apart from each other, such as Grade 2 and Algebra 2, upon 
using a conceptual precedence graph to initialize ratings. 


4. DISCUSSION 


We have used conceptual prerequisite relationships to give 
our model a better prior distribution— one that better re- 
flects the stratified nature of student practice data. The 
depth values used to calculate the initial rating values, how- 
ever, are still quite coarse estimates; for example, the dif- 
ference in difficulty between a unit and one of its prereq- 
uisite units may not be even across the conceptual prece- 
dence graph. Nevertheless, we see that the distribution of 
the lowest-level curriculum (Grade 2 in our study) and that 
of the highest-level one (Algebra 2 in our study) show a 
substantially little overlap compared to when we used the 
initialization method of the original Glicko-2 system, which 
suggests that there was still a nontrivial improvement. Note 
that the separation of unit rating distributions between two 
adjacent curricula (for example, Grade 2 and Grade 3) are 
not well separated. This is expected, as we would not ex- 
pect a huge jump in terms of curriculum difficulty from one 
school year to the next. 


One interesting area of application of this framework is de- 
termining the appropriate grade level for students whose 
mathematical achievement levels are substantially ahead or 
behind their grade levels. With estimates of item difficulties 
that account for grade-level hierarchy, we can have a data- 
based justification that would allow gifted students to be 
placed at a higher-level curriculum that is neither too hard 
nor too easy for them. Likewise, we could allow for students 
lagging behind their peers to be placed at a lower-level cur- 
riculum, where they could ensure that their foundational 
understanding of lower-level mathematical concepts is firm 
before moving onto the next grade level. For this applica- 
tion, a separate round of validation with external measure- 
ments, e.g., standardized test scores, must first take place. 


A well-known limitation of using the Glicko rating system 
family for educational applications is its inability to model 
multiple-choice item correctness probabilities. This is be- 
cause the correctness probability of such an item has an 
infimum strictly greater than 0, making the corresponding 
probability distribution improper. Hence a natural future 
direction would be to address this limitation, e.g., by incor- 
porating the particle-based method presented in [12]. 
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Another potential threat to the validity of using the Glicko- 
2 model for student ability measurement is in its unidi- 
mensionality assumption. Part of the challenge of verifying 
whether the student response data can be modeled with a 
one-dimensional construct in a learning setting is that un- 
like in IRT settings, a student’s ability is expected to change 
throughout the data collection period. An interesting future 
direction would be to investigate whether there is sufficient 
evidence to suggest that students’ mathematical ability is 
multidimensional, and if so, how a model like the Glicko-2 
rating system can be extended to reflect the multidimen- 
sionality; the degree to which the extension presented in [1] 
can be applied also remains to be seen. 


Also, when assigning each curriculum with a depth value, 
the average depth values for all constituent units were cal- 
culated. In practice, however, as learning software product 
continues to expand, units can be added or removed, or their 
edge connections may change. Our current choice of taking 
an average makes the algorithm sensitive to changes in the 
conceptual precedence graph’s internal connectivity struc- 
ture. Median may be a more robust, and thus more practi- 
cal, choice, though this may come at the risk of decreased 
differentiability across consecutive curricula. 


5. CONCLUSION 


We have presented an adaptation of the Glicko-2 rating sys- 
tem in a K-12 math learning software context. The stratified 
nature of student-item pairings has made effective discrim- 
ination of students and problems across grade levels chal- 
lenging. We have shown evidence that by using the prereq- 
uisite relationships between concepts to initialize rating val- 
ues, we can allow for the gradation of rating distributions 
from lower-level curriculum to the higher-level curriculum 
while ensuring that the prediction error for student response 
correctness still decreases over time. A potential area of ap- 
plication is for determining the grade level appropriate for 
students substantially ahead or behind their peers. 
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